Data Unit Tests
Data Unit Test Use Cases
Most teams that we’ve talked to have some form of data quality unit tests already set up, and ask if we recommend migrating everything to Metaplane to save on maintenance time for those tests. The answer is that it depends on whether the outcomes of those tests are boolean or scalar. Some examples of boolean outcomes are:
- Every id is unique
- An order amount is never above $10000 (an example of a department constraint)
- No null values
In these cases, you won’t necessarily need to update your tests and thresholds often, making a machine learning based approach to threshold maintenance less valuable. The only maintenance work you’ll need to do is tracking schema changes (to ensure your query still works on your intended fields or objects), and automation architecture.
Data Unit Test Automation
It is possible, especially in cases where data is infrequently updated (e.g. on a monthly or annual basis), to simply manually run queries. When data is actively being used in production, it’s much more efficient to automate tests to ensure data quality.
Two common ways that we’ve seen teams manage boolean unit test automation are:
- Implementing dbt tests alongside dbt models: This approach requires that you’re using dbt, but offers the benefit of running predefined tests alongside dbt models. With dbt tests, you can reuse your test types and failure triggers across multiple pipelines.
- Implementing unit tests as part of your DAG: This approach doesn’t require vendor-specific tooling, but does require a function to call repeated tests (e.g. stored procedures) and an orchestration tool (e.g. Dagster), so that you can ensure that your tests are run either when data is updated or on a recurring basis.
In both cases, deciding on an alerting mechanism is crucial. dbt tests provide breach counts and can save failed rows to a designated schema table. SQL-based generic unit tests typically store results within the data warehouse. From here, you’ll need to determine how to trigger alert notifications with timing and recipients being essential.
The next question for data teams, then, becomes "What are signs that I should adopt a continuous data monitoring solution?"
- Maintenance times are growing: This can occur due to factors like an improperly implemented DAG or the need for scalar outcome testing instead of boolean outcomes. A proxy for this will be your work queue - are you spending more than 10% of your time managing tests, rather than working on revenue-driving projects?
- Test deployment backlog: Deploying tests on a new data product or schema is often just the beginning. Over time, downstream data consumers may discover new issues as their use cases evolve and test the pipeline differently. If you find yourself spending excessive time in scoping meetings and creating/validating tests, we recommend exploring how to create simple predictive models or evaluating vendors.
Unit tests for data issue prevention
If you have the engineering bandwidth, once you’ve implemented your unit tests, your team should also determine how to apply these tests prior to making any changes to your modeling queries, to proactively prevent rather than simply capture incidents. We refer to this as regressive testing.