How to maintain data integrity
Here are a few considerations that you'll need to keep in mind as you continue to refine data integrity.
“Why does the data need to be right?”
This was one of the questions that an account executive posed to me back when I was their sales engineer, when we were at a now-defunct Google ads bidding optimization SaaS company. At the time, I struggled to answer them without using the phrase “because it does.” Over time, as I progressed in my career and was lucky enough to work more and more closely with data teams, this same question still stuck in my head, because it’s a much more nuanced conversation than taken at face value.
For one, if you’re not using the data, then my account executive was right - it doesn’t need to be right, as sacrilegious as that may sound. On the other hand, the definition of data integrity and the processes to maintain that will differ depending on your particular organization.
Understanding Data Integrity
“Data integrity” is one more term used to describe data quality, which Kevin, CEO of Metaplane, breaks down further into intrinsic and extrinsic dimensions. As a quick recap, the flexible definition of data integrity can be summarized as “what qualities of data matter to you?” For the purpose of alignment over the duration this blog post, let’s loosely define data integrity as:
- The values are accurate.
- The data is whole and fresh.
- The format (i.e schema) is what we expect.
Contributing Factors To Data Accuracy
When you think about the true start and beginning of your data pipeline, defined as data generated by a source (e.g. an event pixel) through to a data product (e.g. imagine a business intelligence dashboard if you need a mental placeholder), every progressive phase of data movement could introduce a data quality incident.
To better visualize data integrity controls, we’ll think about a full stack analytics engineer that’s focused on marketing analytics. This means that they’re ingesting from email campaign sources, digital ads, and will surface insights in a business intelligence dashboard created for the marketing team.
Data Entry and Collection
Let’s take an email campaign tool being used as a source as an example. Data integrity can be impacted from the very outset if a campaign name, used in a conditional statement in a WHERE clause down the line, is inaccurately entered. But let’s not blame our marketing team for all potential issues - there could very well be issues with data integrity at collection points as well, for example, a misfire of pixels capturing related user activity on the account.
If there are humans involved, there’s always a chance for errors to be accidentally introduced. Placing safeguards in the “applications” or wherever manual data entry takes place would be best, but having universal permission sets won’t be realistic for data teams. As a result, checking for data integrity at entry points into their data storage is the most common approach here.
Data Storage and Management
Depending on the tool(s) that you’re using to store your data, you may have additional transaction and storage considerations. For example, if you’re using an older database optimized for transactions, you may run the risk of accidentally locking a table in the midst of an analytics query, causing a dropped transaction.
In another scenario, you may be partitioning or sharding tables, causing user confusion when they query for the full dataset and only receive partial results. Understanding how your data is stored and in the context of your high level infrastructure helps ensure data integrity.
Backups and Recovery
Loosely related to sharding and up for consideration in your data integrity strategy are data backups and recovery. There’s two processes that you’ll probably see this reflected in:
- Disaster Recovery - This is statistically unlikely to happen very often, but your hardware may be wiped out by a natural disaster and data may needed to be migrated to another server.
- Recurring drop and replace - Luckily, we’re seeing these types of processes less and less as storage becomes cheaper, but some companies may still opt to reload data into specific table(s) or schema(s) nightly.
In both cases, cloud users will likely have some sort of backup in place, just by merit of standard offerings by major cloud providers, but it’s a good practice to note the recovery time when it comes to full operational capacity.
Security Measures for Data Integrity
When it comes to data integrity, “security” considerations primarily can be summed up by “access”. Similar to the potential for human error in data entry in business applications, if a user has write access to a dataset or a modeling query, they might introduce a data quality incident.
Imagine someone’s requested your permissions so that they can run their own ad-hoc queries. You trust them, and they know SQL, so you copy over permissions. However, in the process of “fixing” raw data, they mistakenly edit a table materialized by a dbt model, meaning that their “bad” data is still bad everywhere outside of this model and data integrity was not maintained. To minimize the chance of this happening, code reviews and team collaboration should be part of your security stance to protect your data, even from yourself.
Regular Monitoring and Auditing
While unit tests run intraday and/or with every pipeline build can be useful to find issues at a given point in time, regular, automated monitoring, protects data integrity at an even more advanced level. Cutting off data quality issues as upstream as possible in your pipeline naturally mitigates the downstream ripple effects.
In addition to having monitors, it’s also important to audit your monitor strategy over time as your object lists and business requirements change. These sorts of audits should begin with the data team but extend to approvals from the rest of the business, so that you’re able to ensure that you have the correct coverage for all of the new tools that the marketing team is introducing in the new year.
Data Integrity Best Practices To Follow
While the definition of data integrity and high quality data should vary from team to team, there are a few universal best practices that you can adhere to in the interest of maintaining data integrity.
- Outline what data quality means to you and your stakeholders, so that you know what to measure and track.
- Be aware of historical causes of data integrity, whether that’s through data entry or collection. Having incident retrospectives will be useful here, for forward planning.
- Understand how your tools and processes behave and note where improper usage leads to data quality issues. This includes user governance controls!
- Use continuous monitoring to scan your data, and re-evaluate your monitoring strategy over time.
Benefits of Using a Data Observability Tool like Metaplane
Metaplane solves for point number 4 above to ensure everyone trusts the data that powers your business. Metaplane’s best-in-class data observability platform blankets your warehouse with monitors to catch data quality issues when they happen, and tools to help you prevent incidents before they occur. And, best of all, getting set up is as easy as flipping a switch.
Get started with Metaplane today!
The most common path to get started with Metaplane is by creating a free account. Setup consists primarily of granting access to integrate with your data stack and selecting a few data quality monitor placements. After that, our machine learning models will alert to any data quality incidents for however long its enabled for.
While you don’t need to talk to anyone to setup an account, our team is always eager to share how others have succeeded with Metaplane - we’ll meet you at your availability!
Table of contents
Tags
...
...