dbt-expectations: What it is and how to use it to find data quality issues

dbt-expectations is one of the most popular and most-used dbt test packages out there. Read on to learn what you can do with dbt-expectations, and how to implement it in your day-to-day workflow.

Kevin Hu, PhD

and

Will Harris

January 22, 2025

Kevin Hu, PhD

Co-founder / Data and ML

Will Harris

Writer / Data

January 22, 2025

dbt-expectations: What it is and how to use it to find data quality issues

Data quality issues have a habit of sneaking into production when you least expect them. A seemingly innocent upstream schema change or a "minor" modification to an ETL process can cascade into data anomalies that impact business decisions.

This is where dbt-expectations can help out. A pre-built dbt test package, dbt-expectations can help you catch data quality issues before they make it to your data pipeline.

Follow along to learn more about dbt-expectations, how to use it, and see some examples of it in action.

Note: As of 12/18/2024, the original dbt-expectations package is no longer actively maintained. We are taking over active development via a fork here: https://github.com/metaplane/dbt-expectations

What is dbt-expectations?

dbt-expectations is a dbt package that brings the power of Great Expectations-style assertions to your dbt workflows. It provides a comprehensive suite of tests that can be implemented directly in your dbt models, allowing you to catch data quality issues before they propagate through your data pipeline.

When should you use dbt-expectations?

While dbt's built-in tests (`unique`, `not_null`, `accepted_values`, `relationships`) are great for basic data validation, dbt-expectations shines in several specific scenarios:

1. Complex data validation requirements

dbt-expectations is incredibly helpful when working with complex data structures that go beyond simple column checks. Validating semi-structured data like JSON or ensuring your nested data follows specific schemas is somewhere dbt-expectations really shines.

It's also perfect for checking complex string patterns, like validating that your custom IDs follow company conventions or that email addresses meet specific formatting rules. When you need to validate numerical ranges with special business logic exceptions, dbt-expectations gives you more flexibility than basic tests.

2. Time series data quality

Data freshness and completeness are critical when dealing with time series data. dbt-expectations helps you monitor these with specialized tests that make sure your data meets temporal requirements.

You can validate that you're receiving data within expected timeframes, check for completeness across different time periods, and even detect seasonal patterns or anomalies that might indicate underlying issues in your data pipeline.

3. Statistical validation

Data quality issues aren't always as simple as null values or wrong formats. dbt-expectations includes statistical validation tools that help you catch subtle problems in your data.

You can detect outliers using statistical methods, validate that your data follows expected distribution patterns, and identify suspicious changes in aggregated metrics that might indicate upstream issues. These tests are particularly valuable when you're dealing with large datasets where manual inspection isn't realistic.

4. Cross-column validation

Data models often involve complex relationships between columns, and dbt-expectations helps you validate these dependencies. You can make sure logical relationships between multiple columns hold true, validate complex business rules that span different fields, and check compound uniqueness constraints.

This is particularly important when you're working with normalized data models where you need to maintain business logic across multiple columns or tables.

Common dbt-expectations tests

Now that we have an idea of what dbt-expectations can do, let's walk through some of the specific tests and when you might want to use them.

1. Basic value validations

```sql

- dbt_expectations.expect_column_values_to_not_be_null

- dbt_expectations.expect_column_values_to_be_between

- dbt_expectations.expect_column_values_to_be_in_set

```

Perfect when you need to ensure data quality basics. For example, when validating an orders table, you might check that `order_amount` is never negative, `shipping_addresses` aren't null for physical products, or that status values are always from your allowed list of ['pending', 'shipped', 'delivered', 'cancelled'].

2. Row count tests

```sql

- dbt_expectations.expect_table_row_count_to_be_between

- dbt_expectations.expect_table_row_count_to_equal_other_table

```

Great for catching pipeline issues early. You might use these to verify that your daily orders table has at least 100 rows (based on historical minimums), or to ensure your `fact_sales` table row count matches your `staging_sales` table to catch any dropped records during transformation.

3. Relationship tests

```sql

- dbt_expectations.expect_column_pair_values_to_be_in_set

- dbt_expectations.expect_compound_columns_to_be_unique

```

Essential for validating complex business rules. For instance, checking that a `return_date only` exists when `order_status` is 'returned', or ensuring that the combination of `order_id` and `product_id` is unique in your `order_items` table.

4. Time-based tests

```sql

- dbt_expectations.expect_row_values_to_have_recent_data

- dbt_expectations.expect_grouped_row_values_to_have_recent_data

```

Helpful for monitoring data freshness. Use these to verify that your sales data has today's transactions, or that each store in your `retail_locations` table has reported data in the last 24 hours.

5. String pattern tests

```sql

- dbt_expectations.expect_column_values_to_match_regex

- dbt_expectations.expect_column_values_to_match_like_pattern

```

Perfect for standardizing formatted fields. You might use these to ensure all email addresses match a valid pattern, phone numbers follow your standard format, or that product SKUs always match your company's pattern (e.g., 'PRD-[A-Z]{2}-[0-9]{6}').

6. Statistical tests

```sql

- dbt_expectations.expect_column_values_to_be_within_n_stdevs

- dbt_expectations.expect_column_mean_to_be_between

```

Excellent for catching anomalies. You could use these to detect unusual spikes in order values, verify that daily user signup rates remain within historical norms, or ensure average cart values by region don't suddenly deviate from expected ranges.

7. JSON validation tests

```sql

- dbt_expectations.expect_column_values_to_be_json

- dbt_expectations.expect_column_values_to_have_json_schema

```

Essential when working with nested data. For example, validating that your `event_payload` JSON column always contains required fields like `user_id` and `timestamp`, or ensuring that your `product_metadata` JSON follows your expected schema.

8. Aggregate tests

```sql

- dbt_expectations.expect_column_sum_to_be_between

- dbt_expectations.expect_column_distinct_count_to_be_between

```

Perfect for validating business metrics. You might use these to verify that daily total revenue falls within expected ranges, or that the number of unique customers per month doesn't unexpectedly drop or spike.

A real-world dbt-expectations test example

As you can tell, there are loads of different dbt-expectations test, and they're not only meant to be deployed individually. Let's walk through an example where you would use multiple test types at once.

Say you work on the data team at an e-commerce company, and your customer service team is getting flooded with complaints about order issues that could have been caught earlier. Your team noticed a pattern of issues: orders with missing IDs causing failed shipment notifications, negative order amounts breaking the billing system, and marketing emails bouncing because of invalid email addresses.

```sql

version: 2

models:

- name: orders

columns:

- name: order_id

tests:

- dbt_expectations.expect_column_values_to_not_be_null

- dbt_expectations.expect_column_values_to_match_regex:

regex: '^ORD-[0-9]{8}$' # Ensures order IDs follow pattern ORD-20240117

- name: amount

tests:

- dbt_expectations.expect_column_values_to_be_between:

min_value: 0

max_value: 10000

row_condition: "status != 'cancelled'"

- name: customer_email

tests:

- dbt_expectations.expect_column_values_to_match_regex:

regex: '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'

row_condition: "requires_email = true"

tests:

- dbt_expectations.expect_row_values_to_have_recent_data:

datepart: day

interval: 1

column_name: created_at

```

Let’s break down what’s happening here.

In this example, we combined tests from dbt-expectations to validate an ‘orders’ table. These tests check that order IDs aren't null and follow a specific format (ORD- followed by 8 numbers), making sure order amounts are between $0-$10,000 (but only for non-cancelled orders), validating email addresses match a proper format (but only when an email is required), and confirming you have fresh data from the last day.

These tests work together to catch common issues like malformed IDs, suspicious order amounts, invalid emails, and data freshness problems before they impact your downstream systems or analysis.

Tips for implementing dbt-expectations

Now that you have an idea of what kind of tests are in dbt-expectations, here are some tips for how to best implement them.

1. Don't test more than you need

Starting with business-critical data points is the most effective approach when implementing dbt-expectations. Financial metrics that directly impact business reporting and decision-making are often the best place to begin.

Consider this validation for a core financial metrics model:

```sql

models:

- name: core_financial_metrics

columns:

- name: gross_margin

tests:

- dbt_expectations.expect_column_values_to_be_between:

min_value: -1

max_value: 1

row_condition: "is_valid_transaction = true"

config:

severity: error

description: "Gross margin should always be between -100% and 100%"

```

This test can catch critical issues like mismatched units between revenue and cost calculations. Starting with these essential metrics helps build confidence in the testing approach before expanding to other data points.

2. Use severity levels strategically

Not all data quality issues should break a pipeline. Different severity levels help balance data quality with pipeline reliability. Here's an approach for high-value orders:

```sql

tests:

- dbt_expectations.expect_column_values_to_be_between:

min_value: 0

max_value: 1000000

config:

severity: warn # Don't fail the pipeline, but alert the team

description: >

Orders over $1M require special approval.

Values outside this range should be investigated but might be legitimate.

```

By being thoughtful about using severity levels, you can allow legitimate edge cases through the pipeline, create better alerts, and prevent unnecessary pipeline failures.

3. Document your assumptions

Documentation in test configurations can save hours of investigation time, especially for statistical tests where thresholds might not be immediately obvious:

```sql

models:

- name: customer_metrics

columns:

- name: lifetime_value

tests:

- dbt_expectations.expect_column_mean_to_be_between:

min_value: 100

max_value: 5000

group_by: ['customer_segment']

config:

description: >

Based on historical analysis, segment LTV means should fall within this range. Significant deviations could indicate:

1. Data quality issues in order attribution

2. Problems with segment classification

3. Major changes in customer behavior requiring investigation

```

This documentation helps teams understand the reasoning behind specific thresholds, the business logic informing the test, what to investigate in case of failure, and who to contact for specific failure types.

4. Use environment-specific configurations

Different environments often require different validation rules. Development environments might have minimal test data, while production requires comprehensive coverage. Here's how to handle this using dbt's target configurations:

```sql

tests:

- dbt_expectations.expect_table_row_count_to_be_between:

min_value: "{{ 1000 if target.name == 'prod' else 1 }}"

config:

enabled: "{{ target.name in ['prod', 'qa'] }}"

```

This approach creates a flexible testing framework that adapts to different environments while maintaining data quality standards. You get strict validation in production where it matters most, while developers retain the flexibility they need in development environments. It also prevents the frustration of false alarms during development and testing phases, where data completeness isn't the primary concern.

Wrap up

dbt-expectations is an incredibly handy tool for anyone's data quality arsenal. While it requires some upfront investment in setting up meaningful tests, the time saved in catching issues early and maintaining data quality far outweighs the initial setup cost.

Remember to start small—perhaps with basic null and range checks—and gradually build up your test suite as you identify common failure patterns in your data. Your future self (and your stakeholders) will thank you.

dbt-expectations: What it is and how to use it to find data quality issues