Topics in this guide
Get the week's best data engineering content
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Schema Changes

Whether your team is currently using one database for both transactions and analytics, or is centralizing several databases’ worth of data into your analytics lakehouse, it’s important to track schema changes, with distinct benefits for different types of changes.

  • Changed data types: Ensure that existing queries, such as those applicable to numeric data types only, can still function.
  • Deprecated fields and objects: Align data strategy by deprecating unused dashboards using those objects or determine a replacement.
  • Renamed fields and objects: Update queries with new field and object references.
  • New fields and objects: Determine the value of new fields and objects to current or upcoming data projects.

Transactional Database Schema Change Monitoring

Transactional databases are used to store data for software application events, website activity, manufacturing progress, and more. As a result, in a high-growth company using data to drive the business forward, it’s common to see changes or new fields altogether as your teams want to track different aspects of activities. 

Tracking schema changes can be used to improve communication between different groups, and align on activities. For example, taking the case of a software application where the Product team dictates changes that an Engineering team implements, identifying the list of schema changes can lead to better cohesion with the downstream Product Analytics team.

Examples of tracking schema changes

One method that works for all databases is to create a process (i.e. event trigger) that writes DDL statements to a table that you create. Here are examples of how you’d do this for different databases:

SQL Server

Note: You can optionally enable change data capture instead, which will capture both DML and DDL statements, in addition to additional metadata. This will likely use slightly more storage and compute compared to creating your own event trigger.

```sql

CREATE TRIGGER schema_change_trigger

ON DATABASE

FOR DDL_DATABASE_LEVEL_EVENTS

AS

BEGIN

  SET NOCOUNT ON;

  DECLARE @eventdata XML;

  SET @eventdata = EVENTDATA();

  INSERT INTO schema_change_log_table (object_name, operation, ddl_statement)

  VALUES (CONVERT(NVARCHAR(255), @eventdata.value('(/EVENT_INSTANCE/ObjectName)[1]', 'NVARCHAR(255)')),

          CONVERT(NVARCHAR(100), @eventdata.value('(/EVENT_INSTANCE/EventType)[1]', 'NVARCHAR(100)')),

          CONVERT(NVARCHAR(MAX), @eventdata.value('(/EVENT_INSTANCE/TSQLCommand)[1]', 'NVARCHAR(MAX)')));

END;

```

– SQL Server requires  

```sql

ENABLE TRIGGER schema_change_trigger ON DATABASE;

```

PostgreSQL

```sql

CREATE OR REPLACE FUNCTION schema_change_function_name()

RETURNS event_trigger

LANGUAGE plpgsql

AS $$

DECLARE

event_data json;

schema_name text;

object_identity text;

operation text;

ddl_statement text;

BEGIN

-- Get the event data

SELECT json_object_agg(key, value)::json INTO event_data

FROM json_each_text(TG_EVENT);

 -- Extract relevant information

schema_name := event_data->>'schema_name';

object_identity := event_data->>'object_identity';

operation := event_data->>'operation';

ddl_statement := event_data->>'ddl_command';

-- Insert into a table that you create

INSERT INTO schema_change_log_table (log_time, schema_name, object_identity, operation, ddl_statement)

VALUES (current_timestamp, schema_name, object_identity, operation, ddl_statement);

END;

$$;

Here’s how you would set up the corresponding event trigger:

CREATE EVENT TRIGGER schema_change_trigger

ON ddl_command_end

EXECUTE FUNCTION schema_change_function_name();

```

MySQL

```sql

– Enable ddl statement tracking + triggers 

SET GLOBAL log_bin_trust_function_creators = 1;

DELIMITER //

CREATE TRIGGER capture_schema_changes

AFTER DDL

ON DATABASE

FOR EACH STATEMENT

BEGIN

  DECLARE ddl_statement TEXT;

  SET ddl_statement = CONCAT_WS(' ', TRIM(TRAILING ';' FROM EVENT_OBJECT_SCHEMA), EVENT_OBJECT_TYPE, EVENT_OBJECT_TABLE, TRIM(EVENT_STATEMENT));

  

  INSERT INTO schema_change_log_table (schema_name, object_name, operation, ddl_statement)

  VALUES (EVENT_OBJECT_SCHEMA, EVENT_OBJECT_TABLE, EVENT_OBJECT_ACTION, ddl_statement);

END;

//

DELIMITER ;

```

Warehouse Schema Change Monitoring

Due to centralizing data within the same lakehouse, it’s common to reuse the same raw tables and even models created by others with overlapping responsibilities. As a result, it’s also common to have a situation where one schema change creates a ripple effect on multiple data products. 

Sharing schema changes across an analytics team and subset of their key stakeholders can also unify the broader organization’s understanding of where and when schema changes occur, and how these changes impact various data products.

Schema Change Tracking Examples

As of July 25th, 2023, Snowflake, BigQuery, and Redshift don’t offer a singular metadata table or explicit option to write all schema changes to a physical table in your warehouse. As a result, you’ll want to track context (e.g. column counts in a given table) about your schema and cross reference any schema changes with your query_history to determine what type of change occurred.

Snowflake

Here’s one example of a query that utilizes two important views within the information_schema, “TABLES” and “SCHEMATA” to provide table counts for a given database (though it can also do much more!). You would then store the output into a table, potentially with a timestamp column indicating when you snapshotted this metadata. 

```sql

 WITH TABLE_COUNTS AS (

              SELECT TABLE_CATALOG, TABLE_SCHEMA, COUNT(*) AS TABLE_COUNT

              FROM <database_name>.INFORMATION_SCHEMA.TABLES

              GROUP BY TABLE_CATALOG, TABLE_SCHEMA

          )

          SELECT

              CATALOG_NAME AS DATABASE_NAME,

              SCHEMA_NAME AS NAME,

              TABLE_COUNT,

              CREATED AS SOURCE_CREATED_AT,

              COMMENT AS DESCRIPTION

          FROM <database_name>.INFORMATION_SCHEMA.SCHEMATA as s

          LEFT JOIN TABLE_COUNTS t

              ON s.CATALOG_NAME = t.TABLE_CATALOG

              AND s.SCHEMA_NAME = t.TABLE_SCHEMA

```

One alternative solution to look for schema changes is to use Snowflake’s Time Travel feature, which allows you to find what a table(s) looked like at a given point in time. The challenge with Time Travel as a solution, however, is that the number of lookback days are limited by your pricing plan.

BigQuery

You could create snapshots of schema contexts (e.g. table counts), and track any changes made in between snapshots, using views from information_schema. An example of this for columns within a table’s data types would be:

```sql

SELECT

  column_name,

  data_type

FROM

  `your_project_id.your_dataset_id.INFORMATION_SCHEMA.COLUMNS`

WHERE

  table_name = 'your_table_name';

```

An alternative approach would be to use BigQuery audit logs and look like ALTER TABLE statements.

Redshift

In Redshift, there are a few additional system tables that you can use (in lieu of information_schema), which you’ll find prefixed with PG_*, SYS_*, and SVV_*. Some tables of note will be: 

  • SYS_QUERY_HISTORY: This will give you your query history, with timestamps - if there are specific query patterns associated with your DDL statements (e.g. ALTER <table_name> RENNAME <x>), then you can explore parsing the query_text with that in mind.
  • SVV_TABLE_INFO: This will give you row counts and table size for your given table(s). You can use these counts and sizes as early indicators of when data issues may exist.

PG_TABLE_DEF: You can use this table to understand what data types are associated with each column in any of your tables.

Ensure trust in data

Start monitoring your data in minutes.

Connect your warehouse and start generating a baseline in less than 10 minutes. Start for free, no credit-card required.