Where there is data, there is the risk of bad data. And bad data is expensive. As a result, data people today often have more tools in their stack than friends in their bubble to ingest, store, transform, and visualize data.
This overview focuses specifically on data observability tools by describing where they started, how they co-evolved with the modern data stack, and where they might be going.
Who are data people? Their job title might be data engineer, analytics engineer, data analyst, data scientist, or not include the word "data" at all. When in doubt, just ask if SQL column names should have periods in them. If you get a strong response, you're probably talking to a data person.
The problems that data observability tools solve are simple to state:
But before getting into how today’s ecosystem addresses these problems, it’s worth exploring how we got here.
Market hype can make it seem like databases did not exist before Snowflake, Redshift, and BigQuery. But companies ran for decades (and continue to run) on mainframes from the 1960s, then on the main combatants in the 1990s database wars: Oracle databases, IBM DB2, Microsoft SQL Server. In this ecosystem, data teams used all-in-one or vendor-specific database integration tools to manage data quality:
Until the arrival of cloud databases like RDS in 2009 and data warehouses like Redshift in 2012, “data” work within companies was typically done by IT teams using on-premise enterprise software. Transformation was either performed before arrival in the database, or by a downstream business intelligence (BI) application.
All warehouses are databases, but not all databases are warehouses. Unlike transactional databases optimized for reading and writing entries quickly and reliably, warehouses are ideal for running analytic queries like getting the average revenue of a customer segment. The big three warehouses are Snowflake, Amazon Redshift, and Google BigQuery. Data lakes warrant an article of their own, as does their convergence with warehouses.
Warehouses took their name from the warehouse architecture pioneered by Ralph Kimball and Bill Inmon in the 1990s, which centralized business data in a single source of truth instead of separate systems. 2010s technology fit the 1990s architecture perfectly.
Like the color black, spreadsheets will always be in, whether or not they’re the hot new item. In the early 2010s when the “big data” buzzword was taking off, many analysts who bugged their database admins for extracts were confronted with the realization that their data was not clean. Enter the universal API: tabular data.
There will always be a place for an interactive tool for cleaning the last mile of data. But as a new pattern called the modern data stack takes off, there are new upstream opportunities to make sure that the data is ingested properly, transformed into usable forms, and validated before the last mile.
The last mile of data refers to how the data in a warehouse is used. Internal use cases include powering reporting dashboards and operational tools like Salesforce. External use cases include triggering in-product experiences, training machine learning models, and sharing data with partners or customers.
By the mid-2010s, the cloud data warehouse war between Google BigQuery, Amazon Redshift, and Snowflake was in full force. As data warehouses emerged at the top of the food chain, an ecosystem of tools co-evolved alongside them, including easy ways to extract and load data from sources into a warehouse, transform that data, then make it available for consumption by end users.
The Modern data stack is a cutting-edge setup for centralizing and using data within a company. Tools move data from sources like Salesforce into data warehouses like Snowflake that make it very cheap to store gobs of data and easy to analyze it quickly. As a result, data teams have the leverage to do more with less.
The TLDR? Because data storage is dirt cheap, let’s put it all into one place and transform it later.
With the adoption of the modern data stack, more and more data is centralized in one place, then being used for critical applications like powering operational analytics, training machine learning models, and powering in-product experiences. And, importantly, we can keep changing the data even after it’s in the warehouse, lending flexibility to data that was once rigid.
Increasing amounts of data. Increasing importance of data. Increasing fragmentation of vendors. These trends, coupled with investor appetite to fund the next Snowflake, makes it no surprise that we’re witnessing what CEO of Fishtown Analytics calls a Cambrian explosion of new tools supporting data quality.
Some of these tools are open source:
Many more tools are young commercial offerings. Recently, many tools (including ours) have described themselves as offering “data observability,” possibly inspired by the success of the Datadog, a public company that provides observability for software engineers.
Observability is a concept from control theory that describes how well the state of a system can be inferred from outputs. Applied to software systems like an EC2 instance hosting a web server, poor observability might be a health check ping, while strong observability could be a Datadog agent sending system metrics like CPU utilization, network performance, and system logs.
One way we like to think of observability is to ask: How many questions do you need to ask before you're confident in the state of the system?
Applied to a modern data stack, “data observability” tools aim to bring to provide the data needed to answer questions of this sort:
The new vendors in this space, which may outnumber galaxies in the universe, include:
How do you choose between these vendors as a consumer? We found that it’s useful to segment observability tools along four dimensions:
The most important question when evaluating a vendor, of course, is:
Does this tool solve a real problem for me? How frequently does a downstream business user ping you on Slack because a dashboard is broken? How long does it take to peel away layers of logs to identify the root cause?
Why can’t I just use Datadog? Datadog is the default tool to use when monitoring infrastructure and applications, and if you’re mostly concerned with your data pipelines themselves, Datadog can probably do that for you. Datadog is not exactly designed for the data flowing through those pipelines and within the data warehouse at rest. You could probably do it, but you’d be shoehorning abstractions.
Is the modern data stack the final configuration of systems that maximizes utility for data producers and consumers alike? Or is it one apex of the pendulum swinging between bundling and unbundling? One thing is certain: the grooves cut by the flow of data within organizations are here to stay, as are the people responsible for ensuring that flow. And we find it hard to imagine a future in which data teams continue to fight an uphill battle to build trust in data.
Now that data warehousing and extraction/loading are becoming solved problems, we’re eager to see the new technologies that will emerge as warehouses pull more and more data into its gravitational field. Just as the first filmed movies were simply plays in front of a camera, “data observability” can feel like reproducing old concepts in a new domain. But once technologists gain their bearing in a new space, we’re excited to see a future in which data can be delivered faster, more reliably, and in a more usable way to where it needs to be.
We believe data observability tools will play a critical role because, unlike the physical universe, entropy in our data warehouses does not have to tend toward disorder.