Get the essential data observability guide
Download this guide to learn:
What is data observability?
4 pillars of data observability
How to evaluate platforms
Common mistakes to avoid
The ROI of data observability
Unlock now
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Sign up for a free data observability workshop today.
Assess your company's data health and learn how to start monitoring your entire data stack.
Book free workshop
Sign up for news, updates, and events
Subscribe for free
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Getting started with Data Observability Guide

Make a plan to implement data observability across your company’s entire data stack

Download for free
Book a data observability workshop with an expert.

Assess your company's data health and learn how to start monitoring your entire data stack.

Book free workshop

How we built our own dbt alerting infrastructure

TL;DR: We built a new standalone dbt alerting tool that fills the gaps between dbt’s native alerting and Metaplane’s monitoring capabilities. It gives you more context so you can assess the severity of each failed job faster, better routing so you can notify the right people, and an easier setup process so you can get started right away. And it’s a free standalone tool, so any data team can level up their dbt infrastructure.

December 4, 2024

Co-founder / Engineering

Founding Engineer

December 4, 2024
How we built our own dbt alerting infrastructure

We recently migrated to a new billing platform and started ingesting all of the data from it into Snowflake. We added basic test coverage to keep an eye on the quality of the data coming in from our billing provider. Since this was a newer system, we didn’t want to block the pipeline from running, so we set out tests to warn.

One day, our #data-incidents channel in Slack started getting alerts from dbt. When I saw these notifications, I did what any high-performing data engineer would have done—ignored it.

Sometimes, the point of failure in a system isn’t the technology, it’s the habits we build around it. In this case, we’d received so many previous dbt alerts that wound up being nothing. And since dbt alerts didn’t give us the context we needed to differentiate a this is harmless alert from a the system is breaking down alert, we learned to ignore them—until we started feeling the effects.

Luckily for us, it was only a dashboard, but for others millions of dollars could be on the line if an issue isn’t resolved quickly.

Extending dbt alerts and making them useful

We realized something during our dashboard failure exercise—dbt alerts are missing a lot of context.

Without knowing which specific models or tests failed, the full error messages for each failure, or the number of failing records, it’s impossible to know how important each notification really is.

Adding context to make alerts actionable

We built a prototype that parsed the specific failure from the dbt manifest and included it in the error message. Then, we quickly iterated on it because we wanted an easy way to define and check important rules directly in the code.

here's what that version looked like

As we started using the new dbt alerts, we’d forward some failures to our teammates who may know more about the issue. The first question we’d get is: Can I see the data that caused the test to fail?

We tried using the store_failures property, but we ran into a few UX challenges:

  • Test result values would be overwritten in the warehouse and therefore we’d lose historical context
  • We wanted to plan for a future where we can analyze historical test failures. For example, is the same data causing the same failures over time?
  • The user would still need to copy and run the query in Snowflake to see the results and not all users use Snowflake 

This last point is important. If we could give enough context so that people wouldn’t have to run a Snowflake query, not only would it make solving the problem easier, but other teams outside of the data org would be able to debug their own issues, too.

We ended up writing a macro that downloads failing rows to a Snowflake stage and attaches the link to dbt run results. By making this metadata available, we could then reference it when sending alerts.

Now our alerts look like this. Note the view results link

You can even download the results as a CSV:

Advanced routing to notify the right people

In our mission to make each dbt alert more meaningful, we also wanted to add better routing capabilities. 

Our team uses Metaplane to manage alert routing via an intuitive UI. However, since dbt tests are primarily managed by technical users, we wanted to define alert rules programmatically in dbt YAML. This approach allows us to:

  • Avoid alert fatigue by routing the relevant alerts to the right stakeholders.
  • Add routing flexibility with filters like tags, warehouse locations (e.g., schemas), or model ownership.

For example, here’s how we route alerts for models tagged as critical and located in the GOLD schema:

exposures:
  - name: metaplane_alerts
    type: application
    owner:
      name: metaplane
    meta:
      metaplane:
        alerting:
          rules:
            - name: critical models
              description: failures related to critical models
              filters:
                - type: TAG
                  tags:
                    - critical
                - type: WAREHOUSE_LOCATION
                  schema: GOLD
              destinations:
                - type: slack_channel
                  name: high_priority

This ensures that any failure affecting critical models in the GOLD schema is sent to a designated high_priority Slack channel.

Set up your own dbt alerts in minutes

We know we’re not the only data teams experiencing this problem. So, instead of keeping this solution to ourselves, we decided to give it away as a standalone tool!

Head here to set up your own dbt alerts using our alerting tool (you may want to check out the docs, as well).

And if you just want to start downloading test failure results, we open-sourced our macro so you can use it in your own project. 

Table of contents

    Tags

    We’re hard at work helping you improve trust in your data in less time than ever. We promise to send a maximum of 1 update email per week.

    Your email
    Ensure trust in data

    Start monitoring your data in minutes.

    Connect your warehouse and start generating a baseline in less than 10 minutes. Start for free, no credit-card required.