How we built our own dbt alerting infrastructure

TL;DR: We built a new standalone dbt alerting tool that fills the gaps between dbt’s native alerting and Metaplane’s monitoring capabilities. It gives you more context so you can assess the severity of each failed job faster, better routing so you can notify the right people, and an easier setup process so you can get started right away. And it’s a free standalone tool, so any data team can level up their dbt infrastructure.

Peter Casinelli

and

Todd Pollak

December 4, 2024

Peter Casinelli

Co-founder / Engineering

Todd Pollak

Founding Engineer

December 4, 2024

How we built our own dbt alerting infrastructure

We recently migrated to a new billing platform and started ingesting all of the data from it into Snowflake. We added basic test coverage to keep an eye on the quality of the data coming in from our billing provider. Since this was a newer system, we didn’t want to block the pipeline from running, so we set out tests to warn.

One day, our #data-incidents channel in Slack started getting alerts from dbt. When I saw these notifications, I did what any high-performing data engineer would have done—ignored it.

Sometimes, the point of failure in a system isn’t the technology, it’s the habits we build around it. In this case, we’d received so many previous dbt alerts that wound up being nothing. And since dbt alerts didn’t give us the context we needed to differentiate a this is harmless alert from a the system is breaking down alert, we learned to ignore them—until we started feeling the effects.

Luckily for us, it was only a dashboard, but for others millions of dollars could be on the line if an issue isn’t resolved quickly.

Extending dbt alerts and making them useful

We realized something during our dashboard failure exercise—dbt alerts are missing a lot of context.

Without knowing which specific models or tests failed, the full error messages for each failure, or the number of failing records, it’s impossible to know how important each notification really is.

Adding context to make alerts actionable

We built a prototype that parsed the specific failure from the dbt manifest and included it in the error message. Then, we quickly iterated on it because we wanted an easy way to define and check important rules directly in the code.

As we started using the new dbt alerts, we’d forward some failures to our teammates who may know more about the issue. The first question we’d get is: Can I see the data that caused the test to fail?

We tried using the store_failures property, but we ran into a few UX challenges:

Test result values would be overwritten in the warehouse and therefore we’d lose historical context
We wanted to plan for a future where we can analyze historical test failures. For example, is the same data causing the same failures over time?
The user would still need to copy and run the query in Snowflake to see the results and not all users use Snowflake

This last point is important. If we could give enough context so that people wouldn’t have to run a Snowflake query, not only would it make solving the problem easier, but other teams outside of the data org would be able to debug their own issues, too.

We ended up writing a macro that downloads failing rows to a Snowflake stage and attaches the link to dbt run results. By making this metadata available, we could then reference it when sending alerts.

Now our alerts look like this. Note the view results link

You can even download the results as a CSV:

Advanced routing to notify the right people

In our mission to make each dbt alert more meaningful, we also wanted to add better routing capabilities.

Our team uses Metaplane to manage alert routing via an intuitive UI. However, since dbt tests are primarily managed by technical users, we wanted to define alert rules programmatically in dbt YAML. This approach allows us to:

Avoid alert fatigue by routing the relevant alerts to the right stakeholders.
Add routing flexibility with filters like tags, warehouse locations (e.g., schemas), or model ownership.

For example, here’s how we route alerts for models tagged as critical and located in the GOLD schema:

exposures:
  - name: metaplane_alerts
    type: application
    owner:
      name: metaplane
    meta:
      metaplane:
        alerting:
          rules:
            - name: critical models
              description: failures related to critical models
              filters:
                - type: TAG
                  tags:
                    - critical
                - type: WAREHOUSE_LOCATION
                  schema: GOLD
              destinations:
                - type: slack_channel
                  name: high_priority

This ensures that any failure affecting critical models in the GOLD schema is sent to a designated high_priority Slack channel.

Set up your own dbt alerts in minutes

We know we’re not the only data teams experiencing this problem. So, instead of keeping this solution to ourselves, we decided to give it away as a standalone tool!

Head here to set up your own dbt alerts using our alerting tool (you may want to check out the docs, as well).

And if you just want to start downloading test failure results, we open-sourced our macro so you can use it in your own project.

How we built our own dbt alerting infrastructure

Extending dbt alerts and making them useful

Adding context to make alerts actionable

Advanced routing to notify the right people

Set up your own dbt alerts in minutes

Table of contents

Tags

Build a data culture by increasing data literacy

Submit where you want to send your free guide for driving dashboard usage!

How to proactively prevent incidents

Where should we send your 1-pager on incident prevention?

Getting started with data observability guide

Let us know where to send your Guide to Data Observability

Stay updated on the latest product updates

Start monitoring your data in minutes.

How we built our own dbt alerting infrastructure

Extending dbt alerts and making them useful

Adding context to make alerts actionable

Advanced routing to notify the right people

Set up your own dbt alerts in minutes

Table of contents

Tags

Build a data culture by increasing data literacy

Submit where you want to send your free guide for driving dashboard usage!

Please check your inbox for your guide to driving dashboard usage!

How to proactively prevent incidents

Where should we send your 1-pager on incident prevention?

Please check your inbox for your 1-pager on incident prevention!

Getting started with data observability guide

Let us know where to send your Guide to Data Observability

Please check your inbox for your guide to data observability!

Stay updated on the latest product updates

Start monitoring your data in minutes.