Does Every Organization Need a Data Dictionary? Here’s Why the Answer is Yes
Have you ever faced delays because tech and non-tech departments misunderstood each other? Do different teams in your organization use inconsistent data terminology? What about data privacy and compliance — are you confident your organization is handling sensitive data in accordance with regulations?
Do you want to improve this but don’t know how? Well, a data dictionary will help ease your pain.
To understand what a data dictionary is, how it can help save time and money and the steps to building it, read on.
What is a data dictionary?
A data dictionary is a centralized document that is containing all your data points that helps systematically document and define each field, including those containing PII. Any information about a property, like name, its type, constraints, size, owner can be recorded. Errors, redundancies, inconsistencies are easier to find if the information about data is structured. Better tracking leads to better management and results in improved data quality.
How can a data dictionary help my organization?
Builds a Knowledge Base — Personal opinion
There are people who seemingly have all the information in their heads. In today’s world, they probably rely on a combination of tools, documentation, and collaboration to manage and access the information needed to do their tasks. Relying on a few individuals for critical information is risky because it creates bottlenecks, leads to potential knowledge gaps when those individuals are unavailable, and hinders the overall efficiency and resilience of the organization.
I have been working with data for 8 years now and a data dictionary can speed up onboarding and reduce unnecessary communication. Every time I assist a new organization, the pattern is almost the same: there is very little documentation, and it often explains things only at a very high level. When I need to understand how a field is used without having access to the code base, there is nothing else to do but look at the existing queries and ask around if things are not straightforward. Typically, I start by examining the queries, then ask one person who leads me to another, and that person often directs me to the owner of the feature. Sometimes each person gives a completely different answer, other times no one knows the answer.
If there was a data dictionary, I would not have to waste so much time on gathering information from so many different people. I am sure they also have more important tasks to focus on. This just proves how valuable a centralized data dictionary can be.
Reducing friction
A data dictionary can help reduce friction between people. Often, people in an organization collect information from each other and trust the person who provided that information to be correct. If that information turns out to be incorrect, people start pointing fingers at each other, saying, “They gave me the wrong information; they wanted to sabotage me.” With a data dictionary, there is no such issue. Everyone can point their finger at the data dictionary.
Source of truth (SoT) and efficient decision making
Understanding what is the source of truth of which data point is important for many reasons. If the teams do not know the source of truth, they might plan based on wrong information. They might come of with feature ideas that are fundamentally wrong and impossible to implement and put the blame on the tech team because they cannot deliver. You can mark each field in the data dictionary if they originate from a third-party and the result will be improved collaboration.
Detecting Inconsistencies, Errors, Redundancies
With a data dictionary, it will be easy to detect redundancies, inconsistencies and maybe even finding potential errors because the fields are structured and anomalies are easier to spot and mark. It also provides feedback to the tech team, highlighting any redundancies they may have introduced and helping them to identify overlooked issues.
Reducing Misunderstandings and Miscommunication
I had way too many extra meetings just because people had different understanding of the same thing. This confusion led to more meetings and eventually involved someone from the tech team. When people don’t share the same knowledge, they may plan and build on top inaccurate foundations, which can financially harm the organization.
Simplifying the process with Compliance
It is way easier to find the fields that contain PII and plan a data audit. With a data dictionary, you can systematically document and define each field, including those containing
- Sensitive Data Indicator: A flag indicating whether the data in a particular column is considered sensitive or contains personally identifiable information (PII).
- Data Masking: Any data masking techniques applied to protect sensitive information (e.g., masking of credit card numbers, replacing real names with pseudonyms).
- Access Controls: Information on who has access to view or modify the data in each column.
- Data Retention Policy: Details on how long the data is retained and when it should be deleted or anonymized.
- Legal/Regulatory Compliance: Any legal or regulatory requirements governing the collection, storage, and use of the data.
- Consent Requirements: Information on any consent requirements for collecting or using the data, especially if it involves personal data.
- Data Encryption: Whether the data is encrypted to protect it from unauthorized access.
- Audit Trail: Whether changes to the data are logged and auditable to ensure compliance and accountability.
- Privacy Impact Assessment (PIA): Documentation of any privacy impact assessments conducted for the data.
- Data Sharing Agreements: Details of any agreements or contracts governing the sharing of the data with third parties.
How do I build my data dictionary?
Clarifying the Purpose
The first step is clarifying your goals with the data dictionary and have a clear answer to the questions:
- What problem do you want to fix with the data dictionary?
- How will a data dictionary be used in your organization?
- Who will be responsible to maintain this new resource?
Finding Data Stewards
First, you need data stewards whose responsibility is managing and overseeing the data dictionary. It is a responsibility that is similar to a housekeeping, recurring checks and cleanups are needed to create a reliable resource for the organization. This will result with a consistent and accurate data dictionary that people will be using confidently.
You can allow anyone to edit the data dictionary but don’t be afraid to limit edit rights and give the data steward responsibility to someone else in case the current editors are not up to your standards.
Your data stewards need to be people who are organized and responsible.
Setting Standards and Defining the Basic Structure
Define naming conventions, formats, units of measurement and data types. If you don’t define your preferences, your data stewards will and that might not align with the majority’s preferences in your organization.
Gather important data elements you want to track and start documenting them: tables, columns, data types, definitions, constraints, example data, allowable values, default values, etc. You can also define relationships, size and other attributes but it might be too overwhelming. Create a simple starting point, start with an example table and give it the data stewards. It does not have to be perfect, because the data stewards will be happy to correct it.
Choosing a Tool
The chosen tool will depend on the budget and the scale of the project, as well as the maturity of the data stewards. For small projects you don’t have to invest extra money, you can start in Excel or Google Sheet. For larger projects there are database documentation tools like DBDoc that will do part of the work for you if you import your database
Monthly Reviews
For me the best way is to update the data dictionary when encountering something new you didn’t now before or before changes are deployed. Check if the information you figured out exists or if it is still correct and make updates if necessary. If you see a partial entry, dig a bit deeper if you have the time.
Have a process implemented to track changes over time. Review the data dictionary monthly just to see if something stands out like typos, data types etc. Reach out to the person who last edited the entry and let them know the reason of the update.
Feedback Loop
Create a Slack channel where users of the data dictionary can report issues or suggest improvements. This can also help identify potential data stewards among those who actively participate.
Security
Even if it’s not the actual data but just the data structure, this information is still very valuable and shouldn’t fall into the wrong hands. Be responsible when granting permission to update it.