How we built our own dbt alerting infrastructure
TL;DR: We built a new standalone dbt alerting tool that fills the gaps between dbt’s native alerting and Metaplane’s monitoring capabilities. It gives you more context so you can assess the severity of each failed job faster, better routing so you can notify the right people, and an easier setup process so you can get started right away. And it’s a free standalone tool, so any data team can level up their dbt infrastructure.
We recently migrated to a new billing platform and started ingesting all of the data from it into Snowflake. We added basic test coverage to keep an eye on the quality of the data coming in from our billing provider. Since this was a newer system, we didn’t want to block the pipeline from running, so we set out tests to warn.
One day, our #data-incidents channel in Slack started getting alerts from dbt. When I saw these notifications, I did what any high-performing data engineer would have done—ignored it.
Sometimes, the point of failure in a system isn’t the technology, it’s the habits we build around it. In this case, we’d received so many previous dbt alerts that wound up being nothing. And since dbt alerts didn’t give us the context we needed to differentiate a this is harmless alert from a the system is breaking down alert, we learned to ignore them—until we started feeling the effects.
Luckily for us, it was only a dashboard, but for others millions of dollars could be on the line if an issue isn’t resolved quickly.
Extending dbt alerts and making them useful
We realized something during our dashboard failure exercise—dbt alerts are missing a lot of context.
Without knowing which specific models or tests failed, the full error messages for each failure, or the number of failing records, it’s impossible to know how important each notification really is.
Adding context to make alerts actionable
We built a prototype that parsed the specific failure from the dbt manifest and included it in the error message. Then, we quickly iterated on it because we wanted an easy way to define and check important rules directly in the code.
As we started using the new dbt alerts, we’d forward some failures to our teammates who may know more about the issue. The first question we’d get is: Can I see the data that caused the test to fail?
We tried using the store_failures property, but we ran into a few UX challenges:
- Test result values would be overwritten in the warehouse and therefore we’d lose historical context
- We wanted to plan for a future where we can analyze historical test failures. For example, is the same data causing the same failures over time?
- The user would still need to copy and run the query in Snowflake to see the results and not all users use Snowflake
This last point is important. If we could give enough context so that people wouldn’t have to run a Snowflake query, not only would it make solving the problem easier, but other teams outside of the data org would be able to debug their own issues, too.
We ended up writing a macro that downloads failing rows to a Snowflake stage and attaches the link to dbt run results. By making this metadata available, we could then reference it when sending alerts.
You can even download the results as a CSV:
Advanced routing to notify the right people
In our mission to make each dbt alert more meaningful, we also wanted to add better routing capabilities.
Our team uses Metaplane to manage alert routing via an intuitive UI. However, since dbt tests are primarily managed by technical users, we wanted to define alert rules programmatically in dbt YAML. This approach allows us to:
- Avoid alert fatigue by routing the relevant alerts to the right stakeholders.
- Add routing flexibility with filters like tags, warehouse locations (e.g., schemas), or model ownership.
For example, here’s how we route alerts for models tagged as critical
and located in the GOLD
schema:
exposures:
- name: metaplane_alerts
type: application
owner:
name: metaplane
meta:
metaplane:
alerting:
rules:
- name: critical models
description: failures related to critical models
filters:
- type: TAG
tags:
- critical
- type: WAREHOUSE_LOCATION
schema: GOLD
destinations:
- type: slack_channel
name: high_priority
This ensures that any failure affecting critical models in the GOLD schema is sent to a designated high_priority Slack channel.
Set up your own dbt alerts in minutes
We know we’re not the only data teams experiencing this problem. So, instead of keeping this solution to ourselves, we decided to give it away as a standalone tool!
Head here to set up your own dbt alerts using our alerting tool (you may want to check out the docs, as well).
And if you just want to start downloading test failure results, we open-sourced our macro so you can use it in your own project.
Table of contents
Tags
...
...