Deployment best practices
Getting started
Freshness and row count monitors
The first monitors to deploy are freshness and row count monitors. Freshness monitors track if a table hasn’t seen ingestion at its usual frequency (including if the table hasn’t received an update that should have occurred by now), and row count monitors assess completeness or if the table has had as many new rows added as expected (adapting to low and high volumes that would be expected at different times of day, week, month, quarter, or year). These monitors complement one another to give you a bird’s eye view of the execution health of your pipeline.
There are a number of benefits to these monitors:
- They don’t require selecting a column to monitor, so they can be rolled out in bulk from a database or schema level
- They pull from your data warehouse’s metadata (information schema or table info depending on your warehouse), so they reduce cloud usage (more info about Snowflake’s implementation available here)
Due to the minimal compute requirements, these monitors can be deployed broadly across all tables in your environment, but we recommend a more surgical approach. Metaplane provides downstream dependency and query frequency icons that identify key tables to cover with monitors. Monitoring these ensures that the alerts you receive are in tables with dependencies you care about. Use this approach to cover the pipeline from where data lands all the way to end user-exposed tables to get visibility into pipeline breakages and related outages when an incident occurs.
Freshness and row count monitors on views
These monitors are available for covering views in addition to tables. However, the metadata available on views differs from that of tables, so please be sure to take into account the following:
- Row count monitors will scan the view rather than drawing that data from the metadata. If you find these are running slowly, add a WHERE clause or Time window to reduce the size of the query to only the data of interest.
- The metadata for a view only provides the time the view was created, so to monitor freshness, click into the view itself and add a freshness monitor to a timestamp column.
Data quality checks
With row count and freshness monitors checking your pipeline for any process breakdowns, the next step is to deploy monitors for quality control of the data entering your stack. Row count and freshness monitors tell us how much arrived and at what time, but they don’t dig into what was in the data. For that, Metaplane offers a number of additional standard monitors and monitoring tools.
The best data quality monitor for complementing your row count and freshness deployment depends on the data source you’re ingesting. Below are a few scenarios and best practices for each.
Data ingested from application / platform data
Engineers frequently work with application data, often storing it in PostgreSQL or MySQL and modifying the data on the fly. They may add, remove, or rename columns, tables, or whole schemas, and they may adjust what format data is stored in.
To make sure changes in application data don’t disrupt your downstream tables and users, roll out column count monitors on the tables that pose a risk. Customize your schema change monitoring to alert you of any changes in the data types or columns in the parts of the platform that are consumed downstream while ignoring the data that engineers can adjust harmlessly.
Source data with manual inputs
If ingesting data from spreadsheets maintained by users, protect your environment from human error with value range monitors. For numeric data, a mean monitor will watch for data drift, and minimum and maximum monitors can catch when a value is entered incorrectly. Use a cardinality monitor to make sure that enum data is within the values you expect, with an alert firing if a new value is added.
Data from other software tools
When sourcing data from other software tools, the best monitors for the job will depend heavily on the kind of system you’re connecting.
- Marketing software, for example, frequently tracks user interactions across multiple channels, such as LinkedIn and Facebook. If there are problems in reconciling those users across those platforms after ingestion into your warehouse, you may end up with duplicate rows which can pose a risk to joins. Use nullness and uniqueness monitors to make sure the data you’re ingesting isn’t missing key values or bloated with bad data.
- Other tools may use automated routines to grab data from the internet, but these are vulnerable to changes in web design that can lead to ingesting empty or erroneous values. Nullness monitors can make sure that you can act on changes to this data when that happens.
- Use a freshness monitor on a timestamp column to complement a table-level freshness monitor. If the table-level monitor is fine but the column-level monitor fires, it’s likely that data is successfully ingesting, but the data is stale.
Custom SQL monitors
Custom SQL monitors are great tools for capturing other types of data quality challenges. These monitors allow you to define the query that Metaplane runs and the data point that it gathers. With these monitors, you can write joins or CTEs into the query, use a standard function on a combination of columns, or write a complex calculation or cross-table comparison.
Custom SQL can be useful for performing data quality checks that occasionally arise but that are too taxing to manually investigate. Combine a custom SQL monitor with a manual threshold (as opposed to automatic anomaly detection) to encode a data quality standard such as “Alert when this monitor returns >0”. Using custom SQL, have it return the count of rows that violate an expectation:
- a timestamp doesn’t match an expected format using regex
- transactions completed by known “bad actors” identified by past events
- a value in one table should match to an equivalent in another table using a join
Best practices Custom SQL
- Custom SQL monitors were created to detect from a broader range of data patterns than the standard monitors. If the monitor is an advanced use case of a standard monitor (for example, measuring uniqueness of a compound key), set the “Model” to match the model for that aggregation.
- If the same CTE or join is used to produce different values to monitor, consider materializing the data as a view or table and deploying a suite of standard monitors.
- You can load any monitor with values from prior to the monitor creation to enhance its predictive range (this is not limited to Custom SQL, but you may need to use SQL in your data warehouse to retrieve the historical data). This can be loaded via API or as a CSV file.
Transformations
Data observability on your transformations protects you from data drift over time and new edge cases not covered by your existing scripts. The monitors to deploy are the same as those covered above, but applied according to the data your transformation scripts act on.
Your pipeline likely has jobs that regularly clean up the source data to prepare it for downstream consumption. Perhaps that source data has an acceptable volume of nulls and duplicates, so a script removes or transforms that data before it’s used downstream. Use a nullness or uniqueness monitor to ensure that the 2% of a column that’s usually null doesn’t creep up to 8%, or that the uniqueness of a value doesn’t deteriorate and cause bloated joins.
The same goes for casting types, whether you’re consolidating to a consistent timestamp format, standardizing on a numeric grain, or categorizing values to a set of possible enums. A cardinality monitor will protect you when new values outside of your enums are introduced. Deploy custom SQL monitors to make sure the formats you’re matching on don’t grow as a percentage of the source data.
Transformation m Monitors on the intermediate or transformation layeronitors are a secondary opportunity to catch issues that start upstream. An alert in the landing or the transformation layer can help identify root causes and how far the problem extends into other tables in the same layer.