Continuous Data Monitoring
Unlike ad-hoc tests, where data undergoes manual validations, continuous data monitoring automates testing and alerts on anomalies or deviations from expectations. This enables incident capture throughout the pipeline's lifetime.
In the following sections, we explore popular types of data quality metrics to monitor.
Volume and freshness monitors
Two of the most common issues caused by silent data bugs are inconsistent data volumes and delayed data. When either of these problems occur, your data consumers may experience a loss of trust due to using faulty data, with examples being:
Teams often set up monitors for freshness and volume tests as leading indicators of issues on key raw data tables (i.e "Bronze" tables) and modeled tables used in Reverse ETL or BI Dashboards (i.e. "Gold" tables). By leveraging metadata instead of scanning the tables directly, these monitors can run efficiently with minimal compute resources in most transactional databases and warehouses.
Metaplane can automatically deploy freshness and volume monitors across your entire warehouse, as well as suggest where these monitors should be placed on specific tables, based on query history. We’ll also ensure that these monitors’ thresholds stay updated to reflect any subtle changes in your loading or transformation processes.
Sampling queries for freshness and volume
Here are a few examples of how you can query for when a table was last updated (i.e. freshness) and row counts (i.e. volume) in BigQuery, Snowflake, and Redshift. If you’re building your own observability solution, you can write these as distinct entries into a table and/or cloud storage bucket to track the history of updates.
Snowflake
```sql
SELECT table_name, row_count, last_altered
FROM information_schema.tables
WHERE table_schema = 'your_schema';
```
BigQuery
```sql
SELECT
table_catalog AS project_id,
table_schema AS dataset_id,
table_name,
last_modified_time,
row_count
FROM
`project_id.dataset_id.INFORMATION_SCHEMA.TABLES`
WHERE
table_schema = 'dataset_id';
```
Redshift
There’s no system table that users can easily query to list last_updated_at timestamps for all tables within a database. One method that you can do would be to use your query history; depending on how you write this query, it could take a significant amount of compute resources.
```sql
– freshness
SELECT query_id,
start_time,
end_time,
query_text
FROM SYS_QUERY_HISTORY as query_history
JOIN SVV_ALL_TABLES as schema_table_mapping
on query_history.database_name = schema_table_mapping.database_name
WHERE database_name = ‘<your_db_name>’
AND schema_table_mapping.table_name = ‘<your_table>’
```
Row count sampling in Redshift is much more straightforward for a point-in-time measurement, though you’ll need to save results periodically to see what your row count pattern(s) are.
```sql
SELECT database,
schema,
table,
tbl_rows
FROM SVV_TABLE_INFO
```
Data quality monitors
In addition to freshness and volume, there are many more types of data quality metrics that you can establish when querying values within objects directly. Those types of tests that you choose will depend heavily on factors unique to each object, such as:
- What data types exist (e.g. STRING, INT, etc)
- What the data is being used for (e.g. revenue reporting, marketing budgeting)
We recommend placing these monitors on any key reporting or analytics tables that are typically modeled data, curating and combining data from multiple raw tables.
Metaplane simplifies test deployment with a few clicks, utilizing machine learning to establish and maintain acceptable thresholds, reducing the need for domain expertise. Here are various test types, applicable data types, and use cases.
For these data quality metrics above, here are some sampling query examples that you can use, as well as tweak to address your metric of choice:
Mean (can be adjusted for standard deviation)
```sql
SELECT SUM(<column_name>) / COUNT(*)
FROM <table>
```
Minimum / Max
```sql
SELECT min/max (<column_name>)
FROM <Table>
```
Percent Unique
```sql
SELECT
subquery.your_column,
subquery.count_per_value,
(subquery.count_per_value * 100.0) / total_rows AS percentage_of_rows
FROM
(
SELECT your_column, COUNT(*) AS count_per_value
FROM your_table
GROUP BY your_column
) subquery,
(
SELECT COUNT(*) AS total_rows
FROM your_table
) total;
```
Percent Null
```sql
SELECT
subquery.your_column,
subquery.count_per_value,
(subquery.count_per_value * 100.0) / total_rows AS percentage_of_rows,
(count_null_values * 100.0) / total_rows AS percentage_of_nulls
FROM
(
SELECT your_column, COUNT(*) AS count_per_value
FROM your_table
WHERE your_column IS NOT NULL
GROUP BY your_column
) subquery,
(
SELECT COUNT(*) AS total_rows
FROM your_table
) total,
(
SELECT COUNT(*) AS count_null_values
FROM your_table
WHERE your_column IS NULL
) null_values;
```
Customizing your data quality monitors
In some cases, you may want to constrain your data quality model training to particular segments of your data. There are two common situations where this happens -
- Recent data has higher value: In high-growth businesses, data profiles across your warehouse are constantly changing. For instance, in daily in-store sales reports, updating the model with data from the past week instead of the entire history accurately reflects week-to-week sales. To monitor this, sampling queries being used in data quality metrics models should focus on a rolling time window, typically accomplished with a WHERE clause.
- Only particular dimension(s) matter: For example, let’s imagine that you wanted to monitor annually recurring revenue from North America. For a monitor addressing the situation above, with two distinct tables, we’d recommend monitoring your modeled reporting table instead, that has the transactions_table merged with store_mapping.
To gauge your data profile, add WHERE clauses to your own sampling queries if applicable. Alternatively, you can configure Metaplane's monitors to focus on specific segments within your table(s), eliminating the need for modeling and machine learning to establish acceptable thresholds for each test.
Group By monitors
In cases where you have an important metric that applies across different distinct dimensions of your data, you’ll want to independently train monitors for that metric on those different dimensions. Let’s take a look at a few analytics examples to highlight when this would happen -
- Product analytics: Imagine you’re tracking the number of times that a feature was used. In a product that has numerous features with varying levels of adoption, it’d be beneficial to group usage by feature, using different scales.
- Customer analytics: When it comes to tracking customer activities, including context about their payment plan, you’ll also want to group by those customers, usually using an ID field.
- Sales analytics: Imagine you’re tracking sales across each region, where sales volume in Massachusetts is dramatically different from sales volume in California. This would be another scenario where it’d make sense to group your sales numbers by your region field.
- Marketing analytics: Finally, imagine that you’re tracking leads and spend by different marketing channels, such as Facebook Ads and Google Ads. In this case, you’d want to group your lead and spend fields by your “channel” column in your “marketing_channel” table.
- Data providers: If your business model was to sell curated datasets or customize a software application experience for a specific user, it’d be useful to group different data quality metrics by customers and/or cohorts of customers (e.g. dependent on pricing plan).
If you’re writing sampling queries to feed into your machine learning models to generate acceptable quality metric thresholds yourself, the query itself is fairly straight forward:
```sql
Select <metric_calc>
from <table>
GROUP BY <dimension_1>, <dimension_2>, <dimension_n>
HAVING <your_constraints>
```
Within Metaplane, you’ll be able to set these Group By statements and apply monitor types with several clicks, bypassing the need to both write SQL and/or maintain machine learning models.
Source-to-target / Reconciliation monitors
Data warehouses today typically have a variety of ingestion pipelines feeding data in. For example, companies will usually prioritize their Customer Relationship Management (CRM) software, product database, and Enterprise Resource Planning (ERP) data for early analytics use cases, all potentially hosted on different systems with dedicated pipelines.
When creating or updating those ingestion pipelines, two common validations are comparing row counts and ensuring that relationships between objects were preserved.
These validations are most feasible to automate in a database to database (or warehouse / lake) setup. You’ll be able to easily run queries for row counts to compare your source and target, with examples (with slight differences depending on database types) being:
- Row Count (run in both Source and Targets before comparison):
```sql
SELECT table_name, row_count from information_schema.tables
```
- Referential Integrity (run in both Source and Targets before comparison):
```sql
SELECT <row_id> from table_2
where not exists (
SELECT 1
from table_1
where table_1.key = table_2.foreign_key
)
```
Manual Thresholds
For seasoned data teams and mature organizations, you may already know of permanent or semi-permanent rules for your data bounds. In these cases, it makes sense to simply create monitors that checks those, with examples being:
- Numbers that should never be negative: There are a variety of numbers that should never be below 0, with examples being revenue or user count.
- A known real-world constraint: An example of this would be a daily budget that your business stakeholder is held to, that your “spend” column also needs to adhere to.
- Established cron jobs: You may have a 3rd party vendor that sends you flat files for ingestion once daily. In this case, your data quality test only needs to validate that a table has been updated once per day, at a specified time.
It’s best to use manual, hard-coded thresholds only on those constraints that you know to be unchanging or slowly changing, to reduce the maintenance time needed to update those thresholds.
Machine Learning Based Thresholds
Adding tests to your data can be intimidating due to the scope of potential tests and the requirement of domain knowledge; to assist, here's a common process:
- Meet with data consumers to confirm known types of tests and acceptable thresholds for those tests.
(Note: You may need to gauge this yourself with your own sampling queries.) - Confirm that current data profiles fit within your given thresholds, often with ad-hoc queries.
- Automate your queries (e.g. Stored Procedures, or schedule them within your DAG).
- Re-do steps 1-3 as new issues are found and your business changes
Managing tests and thresholds can become cumbersome when dealing with large warehouses, numerous tables, and multiple models that are used daily. As use cases grow, the maintenance of tests and thresholds increases.
Metaplane addresses this with a machine learning-based approach that continually retrains assumptions with your updated data. This ensures thresholds accurately represent data quality issues, reducing alert fatigue and negative business impact.