Forecasting 2023 Data Engineering Trends in 2024
Look - it’s November. It’s not the end of 2023 - and by many companies’ fiscal calendars, it’s actually only the beginning of Q4, meaning there’s still a full 25% of the year left to generate revenue. Having said that, we’re beginning to wind down from conference season and head into the holidays, which means that we’ll see a reduction in large product announcements. That reduction of announcements means the blessed abatement of new terms to learn, which gives us just enough time to reflect on 2023’s trending terms.
Projecting how 2023 data engineering trends fare in 2024
Look - it’s November. It’s not the end of 2023 - and by many companies’ fiscal calendars, it’s actually only the beginning of Q4, meaning there’s still a full 25% of the year left to generate revenue. Having said that, we’re beginning to wind down from conference season and head into the holidays, which means that we’ll see a reduction in large product announcements. That reduction of announcements means the blessed abatement of new terms to learn, which gives us just enough time to reflect on 2023’s trending terms.
The trends below were by no means started in 2023, but seemed to have picked up steam, perhaps through the increased use of data and importance to business, resulting in increased vigilance in how the data department grows and what they own.
Trend 1: Data Contracts
2023 Associated Vendor Products/Features: dbt Contracts, Hightouch Contracts
This might’ve been THE hottest trend of 2023. So hot in fact, that Andrew Jones wrote an entire book about it. Data contracts represent the growth from written business requirements to technical constraints put in place to enforce components of data, such as structure, bounds of values, delivery frequency, and more. The image above covers one area where data contracts could “sit” in the process, but the ideal location will be dependent on your infrastructure and team’s technical capabilities.
Rating: Grow - In the past few years we’ve seen an explosion in data observability (e.g. Metaplane!) and data catalog vendors to address quality concerns of data present in your warehouse/lake. It stands to reason that we’ll continue to see teams turn an eye towards prevention, rather than just resolution, of data quality incidents.
Bonus hot take: Every data “enablement”/”office hours”/”working session” will start with referencing the data contract alongside any submitted tickets 2 years from now. Data contracts will be generated by business stakeholders directly through internal or vendor tooling, though needs to be qualified by the data team before implementation.
Consider when: Business stakeholders frequently report data quality issues for requirements that weren’t originally scoped, frequent turnover or collaboration on models
Trend 2: Data Mesh
2023 Associated Vendor Products/Features: dbt Mesh
The concepts behind data mesh aren’t new - but in 2022 we started to see them grouped under this term. In this current state where we have more access to data than ever before, often sitting in cold storage in the warehouse, data mesh implementation is a process driven solution, backed by technological guardrails, such as access permissions.
Rating: Grow
Hot take: Teams have already adopted aspects of this joint ownership model in the past, so it stands to reason that we'll see more teams consider this as they grow. The hot take is that in 2 years, we stop using this term as it becomes the defacto way to manage your data (and structure your team responsibilities).
Consider if/when: You’re a larger org, the hub and spoke model isn’t working for you, and/or you begin to use the term “data product” to describe 20+ outputs
Trend 3: Data Fabric
2023 Associated Vendor Products/Features: Microsoft Fabric, Talend Data Fabric
Data Fabric setups can be thought of as a combination of multiple concepts: the ability to access data (i.e. via virtualization or access to a landing zone fed by an ELT pipeline), governance controls (e.g. industry regulations), and the ability to surface some context to users about what data is available to access. It began trending in 2023 again, likely due to the growth of Trino and adoption of the word “Fabric” by Microsoft to describe their cloud tool interoperability. Note that a few concepts will overlap with data mesh, but the two concepts should be seen as complementary rather than alternative approaches.
Rating: Growth of term usage, but no growth in adoption
Hot take: Fabric becomes the new “modern data stack”. Teams want to make sure their centralized data storages (e.g. RDBMS, Lakes) are accessible by
Consider if/when: There are technical constraints to moving your data or you’re an enterprise sized company
Bonus Trend: Semantic Layer + Metric Layers
Aka one comparison point of business intelligence tools in 2024
Semantic Layers, Metric Layers, and renewed focus on modeling are officially BACK in 2024. With the new generation of business intelligence tools (e.g. Zenlytic, Count) and those a half generation behind (e.g. Sigma, ThoughtSpot) battling for the throne previously shared by Tableau and Looker (pre acquisition). One differentiating aspect of these tools is how they approach “field specifications”.
The way I’m defining that term in this context is the answer to “How does <BI Tool> define the fields or columns that users are aggregating and/or running other calculations on”. Here’s a few different ways that different tools approach this problem:
- Zenlytic defines it “in platform”, with a similar structure to how you’d specify models in dbt. They’re one of the bigger proponents of using “semantic layer” in 2023 and will likely continue to do so in 2024.
- Lightdash defines field specifications via integrating with dbt’s metric layer. We’ll likely see them anchor on using “metric layer” due to close partnerships with dbt, but will cover the same functional benefits of the semantic layer.
- Thoughtspot’s use of LLMs for analytics emulating “natural search” means that they don’t require users to define field definitions, but do offer a way for users to provide feedback on the accuracy over a response to a question. Over time, your Thoughtspot instance should learn from feedback and other user intent signals what column(s) your business metrics apply to.
- Sigma Computing’s spreadsheet like interface offer users both the ability to query raw data (i.e. BIfFields are 1:1 with warehouse table fields) as well as a way for the data team to provide safeguards for business teams to play in.
- Finally, maybe it’s never really defined because your day-to-day involves primarily working with CSVs that the finance team ships you, queried through a tool like Count.
No predictions on which method definitively wins - the real answer depends on what your teams’ working styles are like. Many of these tools are also so new that you should consider playing with multiple to solidify your understanding of how they each approach semantic/metric layers.
Follow up notes
The types of trending terms that you’ll see are heavily dependent on a few factors: region, industry, company size, data strategy maturity, and even who you follow on LinkedIn.
It’s very very likely that you'll see different trends! For example, if we take a look at the revenue split of the customer base of companies pushing the use of “Fabric”, it skews heavily towards large, Fortune 500 companies. Having said that, please please call me out with any disagreements, inaccuracies, or other topics that you want my thoughts on by emailing me directly at brandon@metaplane.dev!
No matter which of these terms you choose to use and/or adopt, Metaplane helps teams everywhere trust their data. Reach out to learn how other customers have implemented these terms in their data strategy or directly create your free account.
Table of contents
Tags
...
...