Data Observability in Practice
Monitoring
Continuous data monitoring allows data teams to constantly keep a pulse on the health of their data. Imagine a typical data stack - data is ingested from sources, brought into the warehouse, modeled, and reported on using a BI solution. Continuous data monitoring allows a data team to monitor and manage data incidents across this entire pipeline, at scale. This results in better data incident management workflows, identifying potential root causes, and understanding the total downstream impact of an issue.
Implementing continuous data monitoring, compared to standalone unit tests, has the advantage of helping your team add data monitoring at scale. For example, while unit tests are often run at a moment in time to identify known knowns, such as ensuring primary keys have zero null values and are 100% unique, continuous data monitors learn the behavior of your data and are able to predict how the data should behave in the future. This means that monitoring can be added to thousands of assets and identify unknown unknowns, all without the risk of alert fatigue. The most effective continuous data monitoring solutions use machine learning to automatically create thresholds so that the complexity and seasonality of your data is incorporated into the monitoring.
Examples of metrics: freshness, row count, business metric (e.g. is 'revenue' within a realistic range for our business)
Let's consider a scenario where a data quality issue affects revenue attribution for marketing campaigns:
The positive impact of "Campaign A” is overreported (e.g. due to incorrect data from a source system or modeling change), leading the marketing team to allocate more budget to it, expecting amazing results. However, continuous data monitoring would have swiftly caught this issue by tracking historical results and alerting when anomalous behavior was identified, preventing wasted spending and enabling the team to focus on "Campaign B," which would have generated more revenue. This demonstrates one of the advantages of continuous data monitoring: by capturing the nuance and seasonality of behavior in the past, your team can be notified when this behavior changes, without needing to set specific thresholds.
Schema changes
Schema changes are inevitable as products and businesses evolve. These changes can occur from upstream systems, such as transactional databases that power products, or within the warehouse, such as modeling changes. Data teams often differentiate schema changes into two buckets: changes that are helpful to be aware of, and changes that can have real downstream impact.
Schema change awareness
Data teams are often in between software engineering teams and data consumers. This is challenging because upstream teams may not always let the data team know about new data or changes to existing data.
Monitoring schema changes on upstream data sources help create communication and collaboration across software and data engineering teams. For example, by simply monitoring for schema changes in an upstream transactional database like Postgres, a data team can proactively reach out to software teams to learn more about what changes are being made. This leads to better outcomes such as improved modeling based on new data, or avoiding eventual breaking changes to the warehouse.
Data quality issues caused by schema changes
In other cases, schema changes can immediately cause data quality issues. These changes can include:
- Source systems adding or removing tables and columns
- Data type changes, such as from a string to a number
- Adding new analytics dimensions, such as a new event that your team wants to track
- Updating naming conventions
- Unintentional modifications caused by automated pipeline tools, such as a re-cast data type
While not all schema changes result in data quality issues, those that affect columns used extensively downstream in models, dashboards, or reports can have a significant impact. Therefore, tracking schema changes, even for tables not directly monitored, is important to identify potential data quality issues early on.
Metaplane offers the capability to track schema changes for all tables, including those without explicit data quality monitors configured. It allows you to send notifications about schema changes both to relevant stakeholders, such as those who changed the schema, and other downstream consumers, facilitating communication and bridging the knowledge gap on how schema changes affect downstream data products.
Unit Tests
Unit tests are a critical part of any data quality initiative. Many data teams have data quality unit tests in place to address boolean outcomes. These address quality requirements such as:
- A primary key is always unique and never null
- A column always has a <value> under X
- A column’s <value> is always above 0
Data unit tests are typically run as a part of the modeling pipeline. A “failure” response can be used to block pipelines from continuing to run.
Data quality monitoring through data observability, although also working with the same set of data, has a few differences, including:
- Typically more continuous and not dependent on pipeline refreshes
- Addresses scalar outcomes, such as whether the % of null values is within a historical range
We do a deep dive into the concept of unit testing and suggestions on how to automate tests in a following section.
Regression tests
Regression tests help data teams identify bugs caused by changes made to data models prior to merging into a production environment. They can be conducted either in a separate development environment and/or integrated into the production workflow. Integration into the modeling code change flow using a continuous integration / continuous deployment (CI/CD) service ensures checks are performed before changes are merged, protecting downstream tables, dashboards, and models from data quality issues.
Using dbt to help run regression tests
Tools including Metaplane's Github app aid regression testing implementation in CI/CD, preventing negative impacts prior to code change merges. For more detail, including comparisons between testing in a development environment vs. implementation in production processes using CI/CD, read more below.
Column-Level Lineage
Visibility into how every field is used in your data stack and pipelines can be beneficial in almost every step of a data incident workflow.
- Preventing future issues: Understanding field usage in tables and reports helps anticipate how updates, such as schema changes, can break queries.
- Understanding impact: Column-level lineage aids in tracking downstream objects and identifying relevant stakeholders when an issue arises. Having this information enables proactive alerts and gives you a channel to get valuable context for issue resolution.
- Resolving issues: To resolve issues, you’ll typically have to find upstream object(s) to find the new or changed process that created an issue. Having an up-to-date column level lineage in place accelerates issue triage time.
To create lineage diagrams for key objects, you can manually parse SQL and use diagramming tools such as Mermaid or Figma. Alternatively, you can leverage Data Catalog/Discovery vendors or Metaplane, which automatically generates and updates column-level lineage graphs that show data quality issues alongside objects.
Usage Analytics
Data observability solutions leverage metadata to detect issues in data and related processes , or to generate column-level lineage, but can also use that metadata to understand query patterns. Having this information enhances user context for warehouses, can flag potential future issues, and promotes query hygiene. You can use this to:
- Identify key objects: Parsing your queries will help you identify heavily referenced objects, while tracking query frequency provides insight into how often you’re using these objects. By considering both factors, you can exercise caution when modifying key objects to avoid cascading issues throughout your stack.
- Improve warehouse efficiency: In cloud warehouses, longer query runtimes increase computing expenses and can introduce latency in cron dependent flows. In OLTP databases, prolonged queries may even result in missed transactions and database blocking. While daily runtime checks may not be necessary, periodic assessments are valuable for proactive efficiency improvements and reducing compute resource consumption.
- Improve total cost of ownership (TCO) of your data stack: In data stacks including cloud data warehouses/lakes and related tools, assigning specific users or groups for distinct processes (e.g., COPY/MERGE, other DML statements) is a common practice. Tracking queries run by service users helps identify tools that disproportionately contribute to warehouse spend, enabling configuration adjustments to reduce costs.
Metaplane identifies heavily queried objects, shown by icons alongside table names, and provides detailed usage analytics for each table.
It offers insights into the queries executed on each table, users running them, and allocated compute resources for recurring queries.
Things to consider
- Performance of solution:
When extracting your data for both setting your initial test thresholds as well as updating those thresholds, you should be aware of both what type of data you use as well as the structure of your query.
For example, when it comes to determining freshness, teams may opt to start with a timestamp column. This will work well for small tables, but may lead to degrading queries over time as your table gets bigger. We recommend using metadata when possible.
As the table does grow larger, you may also want to consider using that same timestamp column to set your future sampling windows (e.g. where timestamp >= ‘last_sampling’), to avoid requerying the whole table.
In a transactional database, long running queries may lead to table locks, preventing your necessary data from being written. Most data warehouses and lakes will be able to handle concurrent queries without much difficulty, but will typically have internal processes, such as transformations, scheduled by an orchestration tool (i.e. DAG), that will suffer as a result of unexpected latency. - Compute cost: As the performance of a query degrades, the runtime is typically longer, which, on a cloud-hosted solution, also linearly increases price as well.