Getting started with Metaplane
The following is a walkthrough of how to monitor your stack with Metaplane. Every company’s data architecture is different, but by following the best practices below, you can implement robust data observability to build trust in your data by proactively identifying and fixing data quality issues.
This guide is meant as a reference document, and the Metaplane team is on hand to provide more advanced recommendations and guidance should you have any questions. Don’t hesitate to open a support ticket in the app or contact us directly.
Getting started with Metaplane and data observability
What is a monitor?
At their simplest, monitors check for a numeric value and verify that it’s within a predicted range (e.g. this table has 500 rows but it should have 1,000). When the value is outside of the expected range, a data incident is created and an alert is sent to any of several destinations.
Metaplane monitors can collect these values in many different ways:
- Metadata provided by the data warehouse: table-level details like total rows, when the table was last updated, or the number of columns
- Querying a table directly: Metaplane uses typical aggregate functions like SUM(), AVG() or COUNT(), or functions with a GROUP BY expression to calculate the value per group in a column
- Supplied to Metaplane via an API call to the Ingest Datapoint endpoint: numeric values from systems not currently integrated with Metaplane (recommended for advanced users only)
When initially configured, the monitors will collect values and begin to build the predicted range automatically. Every table behaves differently, so Metaplane uses proprietary machine learning models to project the predicted range for normal vs. anomalous behavior. These learn from the observations collected, the type of aggregation, seasonal patterns (different patterns within the day, day-over-day, week-over-week, month-over-month, or even year-over-year), and more. Most monitors finish their training period and become eligible to send an alert within 5 days of hourly observations. Monitors will continue to refine their ranges as more observations are made. With Metaplane, you can be confident that your monitors are predicting behavior that is specific to your data.
The benefits of machine learning-based monitors
Most data teams today care about data quality and have some form of testing, whether querying key tables for status or using a tool like dbt. At Metaplane, we see monitors as distinct from existing tests in these four main areas.
Adaptability
The primary function of a monitor is to alert you when the value collected is outside of the predicted range, for example, when a table updates hours after its usual frequency. Metaplane’s machine learning adapts the range according to context. For example, a large ingestion of data on a Sunday may be anomalous, whereas the same volume on a Monday would be normal based on recurring business day activities. Similarly, healthy business means that the sum of monthly recurring revenue might jump at the end of the month and end of the quarter, but a spike outside of those times may be symptomatic of a data quality issue. The monitors’ adaptive predicted ranges save you from needing to develop a deep understanding of the patterns underlying each table and data point.
Scale
Monitors allow your team to automate health checks across sprawling data architecture. Instead of an alert whenever anything succeeds, fails, or enters a warning region, Metaplane reduces the noise so that you can concentrate incident resolution efforts on data in a problem state. Group By monitors further extend this functionality by building unique predictive ranges for different dimensions within the data, detecting problems that may hide in aggregate.
Proactivity
Data engineering is frequently subject to “executive-driven testing” - a leader in your company messages the team to check if some data is accurate, leading to an investigation. Monitors alert you to possible issues before your end users discover any discrepancies, allowing you to proactively communicate with stakeholders or fix the issue altogether.
Root cause investigations
Multiple monitors can be deployed on a single table to help isolate the root cause of an issue. If and when you do have a data incident, the root cause could be any number of issues. Perhaps an orchestration job managing ingestion hasn’t run properly, a value was manually entered wrong, or a table received data with missing values. By deploying monitors across distinct tables and noting where alerts are fired, you’re able to quickly track down the origin of an issue, whether on a specific part of a table or an upstream table altogether.