Data Regression Tests
If possible, it’s always better to prevent data quality issues from ever happening, rather than tediously troubleshooting them after they’ve occurred. When preventable data quality issues happen and your team is unable to put the processes in place to catch them, it often results in a loss of trust across the organization and in the data itself. One common source of data quality issues that are preventable are code changes made to models.
One approach to testing for the impact on data quality from model changes is to create a clone of your production environments’ tables, run your new modeling query, and then query your data to understand if the results are within your expected range(s). The issues with this approach are:
- The technical debt of maintaining your development environment for the lifetime of your change(s), including replicating tables and surrounding processes (e.g. update cadence)
- Limited understanding of downstream objects outside your development environment (e.g. a business intelligence dashboard)
If you’re able to run regression testing within your code deployment process (i.e. CI/CD process) for your models, you’ll be able to get a much more accurate view of potential data quality issues that will occur as a result of your code changes.
Unfortunately, creating a CI/CD process is not trivial because it requires development and infrastructure resources to implement, but if you do have the resources to invest in incorporating data quality testing, here are some ideas on how to test changes made in this workflow.
Running Regression Tests
- Think about how the data is being used. If you have common queries, then share these queries with teammates to ensure they are run against development data and compare to production
- If you have a set of important dashboards, include this as a part of your testing before merging a pull request. If you are changing a column name, leverage column-level lineage to ensure your downstream dashboards won’t break
- Start simple, and work towards automating these checks. Using tools like dbt can help because there is metadata that can be used to infer where you should be running tests, and what may be downstream of your changes.
Data CI/CD
To implement regression tests alongside code in production environments, integrate them into your modeling code change flow. As data teams adopt engineering development workflows, changes are often made in an environment like Github, enabling continuous integration/deployment (CI/CD) tooling to perform checks before merging changes.
Regression testing within the CI/CD process should not only identify impacted downstream tables and business intelligence dashboards, but also forecast changes in values generated by related models. You can create your own tests with fixed thresholds or utilize Metaplane's Github app, which lists dependencies to prevent downstream negative impacts before merging changes.