How to build a live time series anomaly detection model
Need to know when something’s off? Live time series models can detect anomalies for any metric in your data stack. Here’s how to build one yourself.
How do you detect when a dbt job is lagging?
How do you get early warnings of problems in ETL pipelines?
How do you answer questions like this?
Sure, you can track dozens of metrics to try and make sense of what’s happening in your data stack, but once you have them… what do you actually do with them?
Blindly tracking metrics is easy. Thoughtfully using them is surprisingly hard: you need to constantly review them and decide whether or not they’re healthy using contextual knowledge about your stack. Doing this by hand for hundreds of metrics is time-consuming—if not impossible—and a waste of effort if those metrics aren’t doing anything interesting.
That’s why we need tools to flag us when our once-consistent metrics start looking inconsistent.
Be the first to know, every time
The data team needs to know as soon as any metric looks irregular. Like, if a dbt job that usually takes 20 minutes now takes over an hour, or a regular Fivetran table sync that usually adds 1000 rows now only adds 50.
It might be just a weird fluke, but it might be a signal that something’s broken. If you know immediately when something looks not-quite-right, you can:
- Figure out if there’s a problem at all
- If so, fix the problem before it has major downstream effects
Live time series anomaly detection allows you to do this for any metric in your data stack. But what is it exactly? Let’s break it down:
- Live (aka real-time or online): Using a statistical model on the most recent data (e.g., evaluating the latest dbt job run time).
- Time series: A dataset that is indexed by time or a statistical model for such data (e.g., a dbt job’s run times over the past month).
- Anomaly detection: Flagging some data in a dataset as being unlike the rest (e.g., flagging an unusually long run time for one dbt job run).
Why time series?
Anomaly detection methods that use time-sensitive information perform much better than models that don’t—especially when you’re observing metrics that change over time (like your data warehousing cases that contain trends and cycles).
Trends change over time
Let’s say we track run times for a dbt job, and we ignore the time index. They have a distribution like this, with run times ranging from 10 minutes to 28 minutes, but most taking around 20:
If I told you that the last dbt job run took 25 minutes, would you consider that anomalous? Probably not—it falls neatly into the above distribution. But let’s look at the same data using the time index:
While a 25-minute dbt run might not have been anomalous in the past, you can see at the right edge of the chart that the run only takes about 15 minutes now. The run time of this job has been decreasing as the computing power assigned to the job has increased. So, if a new run did take 25 minutes, you might think, “Hm, that doesn’t look quite right,” and take the time to investigate.
Cycles (aka seasonality) need context
Imagine you have a Snowflake table that gets lots of inserts, but there are fewer on the weekend.
These weekend drop-offs aren't unexpected—they happen consistently each week. So, they're not really "anomalies" in the sense of something going wrong; they're just part of the regular cycle of how your system operates.
Now, if today were a Wednesday, and I told you that 7,000 rows were added in the last hour—no sweat. But if today were a Saturday? You’d probably want to look into it.
Is it an anomaly… or not?
Ultimately, we need a way to decide what counts as an anomaly and is worthy of our attention. Trends and cycles often carry a certain level of noise that can easily distract us from the real signals. Separating the normal noise from the “something’s wrong!” signal is key to identifying true anomalies.
Most methods (including Metaplane’s) use thresholds to decide what counts as normal vs anomalous. Depending on the metric, you might use static, rule-based, or model-based thresholds.
Static thresholds for anomaly detection
The simplest anomaly detection strategy is to set a threshold at a single value, especially if you have a clear idea of what you consider an “unhealthy” value and expect that definition to apply indefinitely.
This kind of static-based approach works well in scenarios where your system is stable and predictable, and you have a strong sense of what constitutes an outlier. It’s straightforward, so you can focus on only the most critical deviations without getting bogged down by minor variations.
But if your metrics vary over time (or when you're dealing with a dynamic system where "normal" values aren't so easy to pin down), static thresholds won’t work for your use case. You’ll either miss important anomalies or get way too many false positive flags.
Rule-based thresholds for anomaly detection
Sometimes, you need an approach that recognizes what constitutes "normal" isn't static; it can vary based on the context. That’s where rule-based thresholds come in.
A rule-based threshold allows you to set more nuanced criteria—or rules—that reflect the inherent rhythms of your data. They’re generally better aligned with your data’s real-world behavior, so they typically reduce noise and focus your attention on the deviations that genuinely matter.
Good news: the complexity of your data doesn’t necessarily require equally complex solutions. Most cases can be covered by simple rules.
Rule-based thresholds are based on clear, logical assumptions about how your data behaves. So, while it works well if you can implement the rule with simple logic, the logic can become unwieldy or impossible in certain complex cases.
Model-based thresholds for anomaly detection
In particularly complex cases, statistical time series models are the most powerful option for setting thresholds because they can handle trends, cycles, and other common patterns. They don't just set a fixed threshold or apply a simple rule; they learn from the data itself.
And while there are many sophisticated models to choose from, even simple models can provide good performance. A linear regression can handle many cases.
By learning from historical data, these models can adapt to changing conditions. This is especially valuable in environments where "normal" is a moving target, and where traditional methods might either overfit (react too strongly to noise) or underfit (miss subtle but significant deviations).
DIY: Live time series model tutorial
Time series models are designed to handle various patterns in the data, such as trends over time, seasonal cycles, or even irregular events that occur with some predictability—that’s what makes them great for setting thresholds when simple rules won’t work. But they do require statistical knowledge to choose the right model, train it, and tune it correctly.
So let’s walk through the setup together. In this example, we’ll implement a linear regression model on some sample data. Here’s the step-by-step process for using a time series model to do anomaly detection on metrics as they come in:
- Download the data file from this link.
import numpy as np
import pandas as pd
import statsmodels.api as sm
- Plot your metric data over time and choose an appropriate time series model to use
data = pd.read_csv('dbt_run_times.csv', parse_dates=["timestamp"], date_format="%m/%d/%y %H:%M:%S")
datetimes = pd.to_datetime(data["timestamp"])
data["total_seconds"] = (datetimes - datetimes.iloc[0]).dt.total_seconds()
data.plot(x = "timestamp", y = "runtime", xticks=[0, 192, 384], xlabel="Time of Run", ylabel="dbt Job Runtime (in minutes)")
💡 Pro tip: Consider adding other features in your model alongside the raw metric data, such as day of week or other metrics you might track.
- Create a training dataset from the historical data, excluding any past anomalies.
training_data = data[~data["anomaly"]].copy()
training_data.plot(x = "timestamp", y = "runtime", xticks=[0, 190, 380], xlabel="Time of Run", ylabel="dbt Job Runtime (in minutes)")
💡 Pro tip: If you don’t know what past data were anomalies, you can start from the beginning of your metric history and loop forward through it with your model, detecting if each new point is an anomaly.
- Set hyperparameters to generate thresholds.
alpha = 0.01
💡 Pro tip: With linear regression, you will usually want to set an alpha level on the prediction interval.
- Train your model on the historical data, tuning the hyperparameters to maximize performance on anomaly detection.
X_train = training_data["total_seconds"].values
y_train = training_data["runtime"].values
X_train_sm = sm.add_constant(X_train)
model = sm.OLS(endog=y_train, exog=X_train_sm)
result = model.fit()
- Generate prediction intervals for each point in the time series data.
def generate_prediction_intervals(ts, alpha):
"""
Generate prediction intervals for each point in the time series data.
Parameters:
ts (pd.DataFrame): The time series data.
alpha (float): Significance level for prediction interval.
Returns:
pd.DataFrame: DataFrame with predictions and intervals.
"""
intervals = []
# Iterate over the time series data
for i in range(10, len(ts)):
# Training set up to the current point
train = ts.iloc[:i,]
X_train = train["total_seconds"].values
y_train = train["runtime"].values
X_train_sm = sm.add_constant(X_train)
# Fit model on training data
model = sm.OLS(endog=y_train, exog=X_train_sm)
result = model.fit()
# Get latest point to generate prediction
latest = ts.iloc[i]
X_latest = latest["total_seconds"]
X_latest_sm = [1, X_latest]
# Generate prediction and interval stats
pred = result.get_prediction(X_latest_sm)
pred_summary = pred.summary_frame(alpha=alpha)
# Extract interval bounds
forecast, pred_int_lower, pred_int_upper = pred_summary[['mean','obs_ci_lower', 'obs_ci_upper']].iloc[0].values
# Append results
intervals.append({
'Date': ts.index[i],
'Actual': ts.iloc[i, 1],
'Forecast': forecast,
'Lower Threshold': pred_int_lower,
'Upper Threshold': pred_int_upper
})
# Create DataFrame from intervals
intervals_df = pd.DataFrame(intervals)
return intervals_df
- Validate your model by backtesting it as follows. See what gets flagged as an anomaly!
backtested_data = generate_prediction_intervals(data, alpha)
backtested_data.plot(x='Date', y = ['Actual', 'Lower Threshold', 'Upper Threshold'])
💡 Pro tip: If there are no anomalies in the history, your tuned model should not flag any anomalies. If there are anomalies in the history, your tuned model should only flag those points.
- When a new metric value comes in, generate a forecast from the model for that timestamp and associated thresholds
new_value = 16.2
new_total_seconds = data["total_seconds"].iloc[-1] + 3600
X_pred = [1, new_total_seconds]
result.get_prediction(X_pred).summary_frame()[["obs_ci_lower", "obs_ci_upper"]]
If the new value is outside the thresholds, congratulations, you’ve detected your first anomaly!
If the new value is within the thresholds, add it to your training dataset and retrain the model. Depending on the velocity of the metric data and the retraining cost, you might not want to do this after every new value, but you should do it frequently enough that your model reflects the current data patterns.
An easier way to scale anomaly detection
Luckily, many time series models are not computationally intensive to train or generate predictions from. And once you’ve completed it for one metric you may be able to use the same time series model for similar metrics by training one model for each metric.
For example, if your linear regression model works well for the primary dbt job, you can use the same model structure for a nightly ETL job that also follows regular cycles. Of course, you'd need to retrain the model on the specific historical data of the ETL job so it can learn the patterns and behaviors of that specific job, but it cuts down the effort big time.
As you branch out to more metric types, continue to revisit the model hyperparameters or consider new time series models to use to best capture the patterns in your metrics. Or if you’d rather not go through this entire process yourself, we’ve got you covered. Metaplane’s ML-based, always-on monitoring alerts you to every issue across your data stack.
Get started with Metaplane for free or book a demo to learn more.
Table of contents
Tags
...
...