Three ways to track schema drift in Snowflake
Schema drift is inevitable, but that doesn't mean you can ignore it. Because before you know it, you end up with missing or inconsistent data, which undermines the accuracy of your Snowflake queries and reports. And without knowing when or where these issues occur, you can't even begin to address them effectively. Here are three ways to track schema drift in Snowflake, so you can stay on top of any changes.
To a data engineer, there are few things scarier than schema drift.
A horror movie marathon at midnight? Child’s play.
Finding a spider in your bed? No sweat.
Opening your fridge to find it completely empty? Okay, this might be a close second.
Still, the true chill down a data engineer's spine is reserved for the unpredictable and often unwelcome surprises lurking within the depths of their data warehouse, courtesy of schema drift.
And the scariest part? Schema drift is inevitable.
What causes schema drift?
Changes in database schema over time—whether that’s additions, deletions, or modifications of columns, tables, or data types—lead to schema drift. These changes can be planned or unplanned, gradual or unexpected.
These are some of the common causes of schema drift:
- Data gets corrupted during migration
- Data warehouse updates such as:
- Adding features or fixing issues
- Establishing new relationships between tables
- Removing existing relationships between tables when they become unnecessary or irrelevant
- Your organization switches to a different data warehouse
- Your organization’s business requirements change and you need to:
- Add new fields for new types of data to be collected and stored
- Remove fields if certain types of data are no longer needed
- Modify the data type of a field to reflect the nature of the data being stored correctly
- You introduce new data sources
- Technology standards change and/or new regulations are introduced
With all those causes, it’s no wonder that schema drift leads to so many data pipeline outages. You've got columns being added, data types changing on the fly, and when they go unnoticed, they swiftly erode data quality.
This results in missing or inconsistent data that not only compromises the integrity and reliability of your queries and reports but also diminishes the overall trust in data across the organization. To mitigate this, you have to track any and all possible schema changes. Here are three ways to track schema drift in Snowflake.
Option 1: Generate and compare schema snapshots
Snowflake and other cloud warehouses’ support for native Schema Change Tracking is still in its infancy, but that doesn’t preclude users from creating their own history of changes.
One way that you can do this is through periodic, recurring snapshots of your schemas. Here’s a simple sample query that you could run to snapshot this for table(s) within a given database:
Use DATABASE <your_database_name>
With snapshot_t0 as
(
Select
table_schema,
Table_name,
Column_name,
Data_type
From information_schema.columns
Order by 1,2,3,4
)
After you create that snapshot, you’ll want to compare for deltas. Imagining we’re staying with just SQL, you’ll then write a few queries to check for common schema changes. Below are sample queries with snapshot_t0
and snapshot_t1
being placeholder names for your snapshot tables.
--- Make sure to either name your snapshot tables with different column names or specify them here.
With schema_comparison as
(
Select *
From snapshot_t0
Join snapshot_t1 on snapshot_t0.table_schema = snapshot_t1.table_schema
)
--- Finding new, dropped, or renamed columns
Select
Column_names_0,
Column_names_1
From schema_comparison
Where column_names_0 != column_names_1
--- Finding new, dropped, or renamed tables
Select
table_names_0,
table_names_1
From schema_comparison
Where table_names_0 != table_names_1
--- Finding new, dropped, or renamed data types
Select
table_names_0,
table_names_1
From schema_comparison
Where
column_names_0 = <some_name>
& column_names_1 = <some_name>
& data_type_0 != data_type_1
There are a few gaps in this approach, with significant outliers being:
- The need to specify which database(s) and schema(s) you’d like to run this for
- Orchestration to schedule both snapshots and deltas
- An understanding of whether a schema change was significant
Option 2: Snowflake Community Python script
Inspired by the Flyway database migration tool, the Snowflake Community python script (schemachange
) is a simple Python-based tool to manage all of your Snowflake objects. You can read all about the open-source script here, but here’s an overview:
- You or someone on your team should have the ability to run Python. Note that you’ll need the Snowflake Python driver installed wherever you’ll be running this script. You’ll also want to familiarize yourselves with Jinja templating if you want to simplify inserting variables as you find yourself with new tables.
- You’ll need to create a table in Snowflake to write changes to, with the default location being:
METADATA.SCHEMACHANGE.CHANGE_HISTORY
- You’ll need to specify your Snowflake connection parameters in
schemachange-config.yml
- You’ll need to write queries that output your desired schemas to compare, following their naming conventions, structured in this way.
This script helps you manually track all of your schema changes. But you’ll also need to explicitly define what tables, schemas, and databases you're tracking—and be able to run Python and be familiar with the CLI to do so.
Option 3: Leverage Snowflake’s information_schema
Similar to Option 2, you can leverage Snowflake’s information_schema
to get a full view of all schema changes. This solution can be helpful for ad-hoc checks when triaging a data quality incident. But keep in mind that, by default, Snowflake only retains this information for 7 days.
An example query for the full list of schema changes would look like this:
SELECT
database_name,
schema_name,
query_text
FROM table(information_schema.query_history())
WHERE
---specify schema change query(s) here
query_text LIKE ‘ALTER TABLE%’
AND
---optional specification for database_name, schema_name, user or compute warehouse here
---alternatively, you can query the information_schema.query_history_by* tables
(
Database_name = ‘<your_database-name>’
schema_name = ‘<your_schema_name>’
role_name = ‘<your_role_name>‘
user_name = ‘<your_user_name>’
warehouse_name = ‘<your_warehouse_name>’
)
This is a great option for triaging incidents that occurred within the past week. You can optionally save the results for a full history of schema changes to reference, but it can quickly become compute-heavy for Snowflake instances with a high volume of queries.
So, if none of these options quite fit the bill and you’re looking for an automated way to track your schema changes, we have another option.
Metaplane automatically monitors your data warehouse for schema changes (e.g. column additions, data type changes, table drops, etc). There’s no need to define what tables, schemas, and databases you're tracking.
The best part? With Metaplane, you can filter out which schema changes you want to monitor and receive notifications about. That way, you can only receive the alerts that are critical to your data's integrity and operational continuity—not just a barrage of noise.
Whether you opt for a manual or an automated approach, the bottom line is this: start tracking your schema changes if you aren’t already. So, choose a method and stick with it! That’s the only way to prevent any more schema drift-related pain, increase the reliability of your data systems, and boost trust across the organization in the process.
Want to get automatically alerted on schema changes? Talk to us or start a free trial today.
Table of contents
Tags
...
...