The great consolidation, Elon’s foray into SQL, and AI’s impact on day-to-day data engineering

Welcome to Overheard in data—a monthly roundup of news you can use from inside the data world.

Kevin Hu, PhD

and

Will Harris

February 25, 2025

Kevin Hu, PhD

Co-founder / Data and ML

Will Harris

Writer / Data

February 25, 2025

The great consolidation, Elon’s foray into SQL, and AI’s impact on day-to-day data engineering

The data world is, as you might expect, chronically online. It’s also pretty scattered. There are tons of great conversations happening across Reddit, Twitter/X, Substack, Medium, and yes, even LinkedIn.

So, in the spirit of data engineering, here’s an attempt to take information from multiple sources, funnel it to a central location, and format it for you to read and enjoy. Welcome to Overheard in data, February ‘25 edition.

Elon Musk’s intro to data engineering

Elon Musk has his hand in a lot of different pots these days, and one of his most recent DOGE undertakings caught the attention of some data engineers.

Welcome to data engineering, Elon!
byu/madredditscientist indataengineering

Non-data folks might take this post at face value, but for us, it raises some questions.

Are all of the people collecting social security really just sitting there in one single master database table?
That table contained a single `IS_DEAD` column?
What did the SQL query look like?
What does the payout amount data look like for each age group?

Of course, Elon isn’t going to dive into the data engineering work that yielded these results, as it’d fly over most people’s heads. At the same time, calling this real data analysis is a major stretch, and highlights why context and business logic are so important in what we do in the data world. Normally, when wonky data like this pops up, we don’t run it up the flag poll as fact, but instead, dive in and find out what other factors are at play—which leads us wonderfully to our next point.

The great consolidation

dbt ignited a lively conversation when they acquired SDF Labs at the beginning of the year. Not about the acquisition itself, but about the theme of acquisitions and consolidation in the data technology market in general.

The past decade saw a significant boom in the data technology market, but now, the data shows a different theme—consolidation.

Image from PitchBook's Q2 2024 Data Analytics report

Sure, this graph is exaggerated by the free money era of 2020-2021, but pair it with the increased number of acquisitions we’ve seen in the data world, and there is a definite theme of tool consolidation.

What does this mean for the data folks actually using the tools? Well, the jury’s still out. Acquisition can bring a major boost to your data stack, as your existing tools might get a lot more capable, but we’ve also all seen very capable tools get neglected by the companies acquiring them.

AI’s impact on day-to-day data engineering work

This topic could probably be a standing theme in the Overheard in data column, as AI’s implications for data engineering are only going to grow. This month’s edition comes from a well-thought-out graph from Zach Morris Wilson on which aspects of data engineering work will be most impacted by LLMs, and which aspects will be least impacted.

LLMs are going to disrupt data engineering more than you would imagine!

Things like:
- Troubleshooting a broken pipeline will soon be delegated to AI
- Answering business questions from properly modeled data will be gone too

Things that will take AI some time:
- Pipeline… pic.twitter.com/dDdgpsIgch
— Zach Morris Wilson (@EcZachly) February 12, 2025

I think Zach’s analysis is spot on here. Already, AI is capable of spinning up SQL queries that used to take much more time to write. And it can even help with some of the more soft skill aspects of the job, like answering business questions about data. Stakeholders can simply feed the data to their AI tool of choice and get a pretty decent understanding of what's happening.

The aspects of the job that AI isn’t likely to disrupt any time soon, though, are the highly strategic elements. Understanding context, business logic, and strategy are the skills that data engineers will likely spend more of their time on in the coming years. They’ll also be the skills that set data engineers apart.

Painful platform migrations

People in any profession and discipline can attest to how painful platform migrations can be. However, data platform migrations are acutely painful. That sentiment was echoed in this Reddit thread from r/dataengineering and in the comments of my own LinkedIn post.

Considering resigning because of Fabric
byu/Ok_Decision_5878 indataengineering

Here’s what usually happens when this kind of migration takes place:

Executive decisions on vendors made primarily on projected cost savings
Even if migration is funded by "vendor credits", they run out long before completion
Reality hits: the true cost includes engineering time, delayed projects, and accumulated technical debt
Sunk cost fallacy keeps the migration going even when the math no longer makes sense

When leadership is too far removed from the day-to-day reality of engineering, these are the kinds of results you get. A platform migration isn't just copying files; it's rewriting integrations, retraining teams, and maintaining two systems during the transition.

Too often, the cost of migration isn't factored into these decisions. Even if you manage to trim your monthly spend by a bit, the migration cost can push your ROI years down the line.

That's why practitioners need to be involved by technical leaders when making platform decisions. Otherwise, you'll end up with losing money on unnecessary migrations, along with potentially losing some of your top talent.

Moving off Pandas—a migration that might be worth it?

After covering the reasons migrating tools is difficult, let’s talk about one that might be worth it.

If you've been working with data in Python, you've almost certainly built workflows around Pandas. It’s been a staple for data manipulation since 2008, but it’s not without its pain points (one might call them “Panda Pains”). Santiago Valdarrama recaps some of them for us in his LinkedIn post.

The issue with Pandas at the present? Its single-threaded architecture and memory inefficiency are showing their age. Modern data workflows demand speed that Pandas simply wasn't designed to deliver.

Despite these limitations, Pandas maintains its crown for one compelling reason Santiago highlights: "Nobody wants to learn a new API." Let's be honest—we're all reluctant to rewrite existing codebases, retrain team members, or risk introducing new bugs into production pipelines that are (mostly) working.

Santiago introduces FireDucks as a potential solution—promising not only better performance but an easy migration process. Simply change a single import line (import fireducks.pandas as pd) or use an import hook (python -mfireducks.imhook yourfile.py).

This does present a good counterpoint as to when to actually switch platforms. Increasing pain/frustration along with lower barriers to switch make it much easier to say yes to a platform change.

Did we miss anything? Find my LinkedIn post to comment on the other interesting topics you’ve seen in the data world this past month.