Data Incident Management
A data incident management workflow is a standardized process for handling data quality issues. Establishing this process is crucial to build trust, prevent negative impacts, and expedite issue resolution and communication with stakeholders.
What constitutes an issue
Any change to your data profile may be classified as a data quality issue, but this isn’t always the case. For example:
- Schema Changes: A column rename impacts downstream models that reference the field, but you will also have many cases where a column isn’t actually used by any data products.
- Distribution Changes: A distribution change incident can be exemplified by a sudden drop in US subscription revenue from $2M to $0. This decline could indicate either a complete halt in user activity or an expected outcome caused by deprecating US support.
The severity and priority of an incident depends heavily upon what it’s used for, leading us to our next step, identifying data products and stakeholders.
Identifying Impact
In general, we can group data products into 4 primary categories:
- Analytics reports and dashboards (e.g. a monthly revenue tracking dashboard used by sales leaders)
- Automated data feeds to business applications (e.g. updating a customer satisfaction score in your CRM based on customer activities on your website)
- Automated data feeds to a customer facing product/feature (e.g. providing feature recommendations based on product usage)
- Data as the product (e.g. selling aggregated weather and economic forecast data created with your company’s proprietary models)
The first two categories typically involve internal stakeholders, while the last two involve external customer stakeholders. Identifying your stakeholders is crucial for proactive alerting and updates during incident identification and resolution, maintaining trust in your team and product.
Knowing who your stakeholders are allows you to proactively alert and update them as you identify and solve incidents to help preserve trust in your team and/or product.
To understand impact, you’ll need to:
- Identify issue origin: To investigate a data quality incident, identify the directly referenced fields and objects. For ad-hoc querying, obtain a query sample from the user(s). Analyzing a dashboard requires understanding its creation process and potentially parsing the modeling definitions used for field creation.
- Identify downstream impact(s): After identifying the key table(s), trace their downstream usage. You can do this through a combination of conversations and/or examining other dashboards, reports, and highly leveraged modeled tables. This can be done with manual query parsing or automated parsing provided by your in-house team or vendor tools. Metaplane automatically generates lineage to business intelligence reports and dashboards. The objective is to understand the downstream usage of erroneous data, provide context for resolution, and communicate with stakeholders during troubleshooting and resolution.
Sample metrics to determine impact: Number of impacted tables, number of impacted dashboards/reports, stakeholder prioritization, dashboard prioritization, automation outside of warehouse (e.g. reverse ETL, embedded external customer analytics dashboards)
Setup alert destinations
Once impacted stakeholders are identified, establish an alerting destination that aligns with their notification workflows to keep them informed about incidents. For instance, this could be a dedicated Slack or Microsoft Teams channel like #marketing_analytics_issues, including both marketing executives and analysts. Additionally, consider setting up email listservs if email notifications are preferred in your company.
As the incident is being resolved, these channels will become a central point of communication. You may want to further determine alert routing and subsequent channel setup with these two methods:
- Object metadata: In a data mesh or similar setup, you can group alerts into relevant channels based on existing organizational structures and ownership. For distributed ownership or central teams curating data, manual tags can be used to provide context and determine appropriate alert placements for key tables.
- Alert severity based on type of incident: For scoping issues, it can be helpful to segregate schema changes from other incidents, especially in noisy environments like Salesforce where schema changes are frequent. We’ve seen teams create dedicated channels, such as one for Revenue Operations and Sales Analytics team members, to facilitate communication between stakeholders and upstream system administrators in analytics reliant on Salesforce.
Triaging Incidents
To solve an incident, follow these steps:
- Trace upstream queries to examine how a table was created or updated, helping identify potential issues with modeling tools (e.g., dbt or Coalesce) or orchestration tools (e.g., Airflow, Dagster, Prefect).
- If the incident is caused by data in a raw table, investigate how the data was loaded.
- Involve stakeholders responsible for building the pipeline in the case of a homegrown solution and identify factors like API version changes, outages, or infrastructure issues.
- Consider manually rerunning workflow(s) to backfill data if urgently needed. If there was a temporary 3rd party dependency, consider re-running the pipeline to backfill data.
Note: The steps may vary depending on the specific incident and data stack setup.
Create an on-call rotation
Establish an on-call rotation, including a variety of stakeholders, to address future incidents. Including both business and technical team members will expedite resolution, provide additional context, and ensure a faster validation of fixes.
- Business stakeholder(s): For internal analytics, involve at least one person from the affected department. In customer-facing applications, ensure someone from the customer success or product team is informed to assist with customer communications during the fix deployment.
- Technical team member(s): You’ll want to include not only data engineers, analysts, and scientists, but also consider software engineers who may have worked on dependencies (e.g. sending events to Kafka).
Once team members are identified, establish their involvement timeline based on incident severity levels or other prioritization formats. Utilize sample metrics mentioned earlier for this purpose. Advanced data teams may also consider implementing Service Level Agreements (SLAs) to formalize the expected resolution time for data quality incidents.
Data incident retrospectives
Finally, to prevent future incidents from occurring, your team should have a formalized data incident retrospective process. You’ll want to include, as points of discussion, a few areas, such as:
- Description of incident
- Impact of incident (both to the business and stakeholders)
- What led to the incident
- How an incident was fixed
- Safeguards created as a result
You’ll also want to establish a cadence and reporting process for these data incident retrospectives. A monthly cadence is a good place to start as you begin to unearth incidents, though you can make this quarterly as the number of discovered incidents begins to tail off.