Data Quality Begins and Ends Outside of the Analytics Team
The people traditionally most concerned with data quality are naturally the people debugging data issues themselves, being analytics teams. Conversations around data quality have been focused around data pipeline tests and anomaly detection. What about ensuring data quality both upstream and downstream of analytics workflows?
The people traditionally most concerned with data quality are naturally the people debugging data issues themselves, being analytics teams. Conversations around data quality have been focused around data pipeline tests and anomaly detection. What about ensuring data quality both upstream and downstream of analytics workflows?
Analytics teams own flows, but not the underlying data. Nor do they own the downstream tools data moves into (outside of business intelligence dashboards). Marketing, sales, e-commerce teams are in a similar position: they build custom user experiences using data sets they don’t own.
Consider your classic modern data stack: Fivetran or Airbyte pump app data into Snowflake, transformations are run in dbt, and Census or Hightouch take data back into those source apps. In a specific e-commerce example, say Fivetran takes data from Shopify on product views, carts, and checkouts. Transformations then create user marketing segments that propagate to Google and Facebook audiences.
Now something changes. Fivetran sends out an email about a Shopify integration schema change that the one lonely analytics engineer accidentally misses because they’re swamped and marketing audiences aren’t turned off. Or, the name of the product viewed event changed as requested by commerce but that change didn’t propagate downstream. Or, the column name in dbt changes. The point is, a change can rarely be made in a vacuum but this is often forgotten, with good intention of moving quickly and delivering. The consequence is both lost trust and incorrect reporting to decision makers.
Everyone should be concerned about quality, because the point of data is to enable other parts of the organization to act on it. What does data quality mean and achieve outside the scope of analytics?
Building trust from analytics and beyond
Analytics teams establish data trust with stakeholders by ensuring there are safeguards in place to make sure, if a bug is accidentally deployed, someone knows about it. For instance, a fixed data integrity test to ensure product views don’t drop to 0 for everyone on a given day as that might indicate a data generation or transformation bug.
Software engineering teams have testing frameworks to get closer to bug-free production deployments, including unit testing (like Pytest) and observability tooling (like New Relic). Similarly, the frameworks and cloud tools for analytics teams to run automated tests are just now catching up.
However, the testing I’ve described is single point: testing data meeting certain dimensions of data quality (like percent of rows that have nulls or uniqueness of a primary key) through data profiling, or freshness (i.e. timeliness). Software engineering teams make heavy use of API contracts: in short, documentation and agreements around how data should be made available in data sources and how to interact with API endpoints.
There’s no contractual obligation beyond the data stack to be maintained.
Sure, we can test how data flows through our dbt models. But even if all of these tests pass, who’s to say the marketing team is informed about the changes and effects to marketing audiences.
There are data governance tools on the technical side that help build these contracts across teams, most notably Avo which aims to ease communication and data quality when it comes to software engineering, analytics, machine learning, and product teams. This type of contractual obligation shouldn’t be specific to technical teams. Why not maintain the circle of trust between analytics and marketing, sales, commerce, finance with more communication?
Quality matters in all departments with different implications
Let’s revisit that marketing audience example. So we have Shopify data going to a warehouse, being transformed at the user level, creating a marketing audience going into Facebook and Google.
If you ask an analytics engineer how they would go about ensuring data quality, they might say leveraging dbt tests as that’s a technology already used with an available testing framework. But what if we change the dbt models and tests pass? Does data quality end with the analytics team? What if marketing changes the name of the audience on accident, so it no longer populates?
The output of any changes (whether purposeful or accidental) not being caught upstream or downstream will result in valuable company dollars being spent showing ads to marketing audiences that might be stale or inaccurate.
I’d argue the implications of data quality issues beyond analytics teams is far greater: they result in tangible losses.
One could say quality issue, another could say miscommunication, and a third could say a simple accident; the terminology isn’t important. A change made in one place and not reflected in another, related, place will have consequences.
Quality for business teams means understanding how any change may impact existing workflows directly in operational tools. In our example, this means ensuring marketing audiences are fresh and accurate. Here, I would interpret accuracy to mean those same expectations of dbt models holding true for the raw marketing audience data, with some additional expectations the marketing team might have layered on top.
We’re at an exciting time: defining accuracy outside of data testing is wide open, and is largely dependent on the data quality tools available to respective teams to actually evaluate accuracy easily.
Holistic data quality yields quality products and user experiences
I’ll bring up something I started with: everyone should be concerned about quality. Ultimately, every employee should care about the end user. The more safeguards and checks there are, the more sure you can be that the end user is getting the best possible experience of your product.
Integration points are always the trickiest to get right.
The puzzle isn’t done until all the pieces are placed. Similarly, data trust and quality aren’t complete until everyone can do their job consistently and well. All teams should be able to sleep at night knowing emails were sent out formatted correctly, ads were shown to the most relevant populations, and data was flowing without a hitch.
Have any other thoughts on holistic data quality? I’m happy to chat on Twitter or LinkedIn.
Table of contents
Tags
...
...