News
The Featured Post for this month is on AI and Data Engineering.
Data engineering and AI are deeply intertwined. AI relies heavily on data, and data engineering provides the infrastructure and pipelines necessary to make that data accessible, clean, and usable for AI models. In turn, AI is starting to automate and enhance data engineering tasks, creating a symbiotic relationship.
The relationship between data engineering and AI is increasingly symbiotic. Data engineers are the backbone of AI, while AI is becoming a powerful tool for data engineers to enhance their work, improve efficiency, and unlock new possibilities.
- Throughout the week, we will be adding to this post articles, images, livestreams, and videos about the latest data engineering developments (select the News tab).
- You can also participate in discussions in all Data Engineering onAir posts as well as share your top news items and posts (for onAir members – it’s free to join).
Anthropic: How we built our multi-agent research system
Anthropic writes about its Claude’s Research feature using a multi-agent system that distributes research tasks across specialized subagents via an orchestrator-worker pattern. The architecture boosts performance by parallelizing exploration and token usage, with key insights into prompt engineering (delegation, scaling, tool design), evaluation (LLM-as-judge, human-in-the-loop), and production hardening (stateful runs, debugging, orchestration).
LinkedIn: Introducing Northguard and Xinfra: scalable log storage at LinkedIn
LinkedIn unveils Northguard, a Kafka replacement built to handle over 32 trillion daily records by addressing scalability, operability, and durability challenges at hyperscale. Northguard introduces a sharded log architecture with minimal global state, decentralized coordination (via SWIM), and log striping for balanced load, backed by a pluggable storage engine using WALs, Direct I/O, and RocksDB LinkedIn developed Xinfra—a virtualized Pub/Sub layer with dual-write and staged topic migration to enable seamless migration, ensuring zero-downtime interoperability between Kafka and Northguard.
Canva: Measuring Commercial Impact at Scale at Canva
Canva writes about its internal app “IMPACT,” a Streamlit-on-Snowflake app that automates measurement of business metrics like MAU and ARR across 1,800+ annual experiments. Built with Snowpark, Cortex, and the Snowflake Python connector, the app replaces manual, error-prone analysis with a self-serve interface that aligns with finance models, supports pre/post-experiment workflows, and stores results for downstream use. Its modular architecture and PR-driven dev workflow enable scalable collaboration, while natural language summaries and scheduled metric calculations streamline impact analysis from hours to minutes.
Artificial intelligence is transforming the entire pipeline from college to the workforce: from tests and grades to job applications and entry-level work.
This is a hard time to be a young person looking for a job. The unemployment rate for recent college graduates has spiked to recession levels, while the overall jobless rate remains quite low. By some measures, the labor market for recent grads hasn’t been this relatively weak in many decades. What I’ve called the “new grad gap”—that is, the difference in unemployment between recent grads and the overall economy—just set a modern record.
In a recent article, I offered several theories for why unemployment might be narrowly affecting the youngest workers. The most conventional explanation is that the labor market gradually weakened after the Federal Reserve raised interest rates. White-collar companies that expanded like crazy during the pandemic years have slowed down hiring, and this snapback has made it harder for many young people to grab that first rung on the career ladder.
But another explanation is too tantalizing to ignore: What if it’s AI? Tools like ChatGPT aren’t yet super-intelligent. But they are super nimble at reading, synthesizing, looking stuff up, and producing reports—precisely the sort of things that twentysomethings do right out of school. As I wrote:
The Conversation, – June 30, 2025
Reenvisioning the college major
Assuming the requirement for students to complete a major in order to earn a degree, colleges can also allow students to bundle smaller modules – such as variable-credit minors, certificates or course sequences – into a customizable, modular major.
This lets students, guided by advisers, assemble a degree that fits their interests and goals while drawing from multiple disciplines. A few project-based courses can tie everything together and provide context.
Such a model wouldn’t undermine existing majors where demand is strong. For others, where demand for the major is declining, a flexible structure would strengthen enrollment, preserve faculty expertise rather than eliminate it, attract a growing number of nontraditional students who bring to campus previously earned credentials, and address the financial bottom line by rightsizing curriculum in alignment with student demand.
Andreas Kretz – June 11, 2025
In this podcast episode, I’m joined by Simon Späti, long-time BI and data engineering expert turned full-time technical writer and author of the living book Data Engineering Design Patterns.
We talk about:
His 20-year journey from SQL-heavy BI to modern Data Engineering
➡️ Why switching from employee to full-time author wasn’t planned, but necessary
➡️ How he uses a “Second Brain” system to manage and publish his knowledge
➡️ Why writing is a tool for learning — not just sharing
The concept of convergent evolution in data tooling: when old and new solve the same problem
The underrated power of data modeling and pattern recognition in a hype-driven industry Simon also shares practical advice for building your own public knowledge base, and why Markdown and simplicity still win in the long run. Whether you’re into tools, systems, or lifelong learning, this one’s a thoughtful deep dive.
Uber: The Evolution of Uber’s Search Platform
Shopify: Introducing Roast – Structured AI Workflows Made Easy
Sem Sinchenko: Why Apache Spark is often considered as slow
Meta: Collective Wisdom of Models: Advanced Feature Importance Techniques at Meta
https://roundup.getdbt.com/p/from-docker-to-dagger-w-solomon-hykes
In this season of the Analytics Engineering podcast, Tristan is digging deep into the world of developer tools and databases. There are few more widely used developer tools than Docker. From its launch back in 2013, Docker has completely changed how developers ship applications.
In this episode, Tristan talks to Solomon Hykes, the founder and creator of Docker. They trace Docker’s rise from startup obscurity to becoming foundational infrastructure in modern software development. Solomon explains the technical underpinnings of containerization, the pivotal shift from platform-as-a-service to open-source engine, and why Docker’s developer experience was so revolutionary.
The conversation also dives into his next venture Dagger, and how it aims to solve the messy, overlooked workflows of software delivery. Bonus: Solomon shares how AI agents are reshaping how CI/CD gets done and why the next revolution in DevOps might already be here.
Data Engineering Central, – June 23, 2025
We all love our little language wars: the vitriol, the expletives, the dank memes that accompany them.
You may recall, as your friendly neighborhood Anonymous Rust Dev, that I’ve been magnanimous and kind to the Python language. This is because I’m above it all – I can find the beauty in any language if I look hard enough, and I’m willing to give credit where it’s due.
But even I have my breaking point, which I’d like to explore today. How clunky does a language need to be to make itself my foe?
SeattleDataGuy’s Newsletter, – June 19, 2025
1) Avoid Being The Only Engineer On A Team Early On In Your Career
2) Hype Won’t Fix Your Business Problems
3) Solving Business Problems Is Hard
4) Don’t Skip The Fundamentals
5) Don’t Be Above Putting in the Work
6) Ask Why – But Don’t Always Expect A Great Answer
7) Learn to Communicate with Non-Data People
What’s Next? Predictions for the Future of Search and SEO
- Visibility Shifts from Keywords to Topical Coverage
- Personalization Becomes Standard
- Paid Placements and Monetization in AI Mode
- The Agency Divide: Traditional vs. Modern Approaches
- New Analytics and Search Console Features
- Final Thought: Search Is Now a Conversation
As AI Mode evolves from an experiment to a default experience, search will feel less like a transaction and more like an ongoing conversation. Success will go to brands that can be present, relevant, and trusted at every step—no matter how the questions change.
To grow and thrive in the rapidly evolving AI landscape, organizations must strategically invest in their data engineering capabilities.
In today’s modern digital landscape, businesses are generating heavy data daily which can be processed, analyzed and interpreted for future scalability and growth. This is when AI-driven systems become integral across industries to help create real-time analytics, forecasting and initiating AI-driven automation. Beverly D’Souza, a Data Engineer at Patreon (previously worked at Meta) has played a key role in improving data workflows, processing data at pace and launching machine learning models. Having experience with ETL pipelines, cloud data systems, and AI analytics, she shared, “Building scalable AI-powered data pipelines comes with key challenges and to overcome these obstacles, organizations must implement distributed computing frameworks that can handle large-scale data processing efficiently. Incorporating AI-driven automation helps streamline data processing tasks, making the entire system faster and more efficient.”
The Efficacy of Standardized Interfaces in Reducing Manual Intervention, Improving System Resilience, and Enabling Advanced Governance in Enterprise Data Platforms.
Future Outlook
Agentic AI, MCP, and A2A are poised to reshape enterprise data platforms in the next 3–5 years. We will likely see middleware for agents become mainstream: platform vendors already integrate these protocols. Systems will evolve from static ETL pipelines into adaptive, self-optimizing networks. For instance, future data lakes might auto-tune storage tiers based on usage patterns discovered by agents, or data warehouses could self-partition hot tables. As Microsoft’s announcements highlight, AI is moving toward an “active digital workforce” [33]: think LLMs that don’t just suggest queries, but execute workflows end-to-end. With A2A, agents from different vendors and clouds will interoperate, breaking current silos. Enterprises will embed AI into governance: agentic systems continuously audit for compliance.
We may also see advances in model capabilities driving agentic efficiency — e.g., hybrid systems where a symbolic planner guides LLMs, or LLMs with built-in code execution (like Azure’s CUA) making some MCP calls redundant. Standards (MCP, A2A) will likely expand; Google’s A2A is already collaboration with 50+ partners [40], promising broader interoperability. In short, the data platform of the future could sense, reason, and act: as data patterns shift, agents reconfigure pipelines; when costs spike, agents throttle resources; when new regulations arrive, agents update data handling policies. This vision of an adaptive, self-driving data platform is on the horizon thanks to agentic AI and these new protocols.
Bastille Post Global, – June 9, 2025
Amperity, the AI-powered customer data cloud, today launched Chuck Data, the first AI Agent built specifically for customer data engineering. Chuck uses Amperity’s years of experience and patented identity resolution models, trained on billions of data sets across 400+ enterprise brands, as critical knowledge behind the AI. Chuck runs in the terminal and empowers engineers to quickly understand their data, tag it, and resolve customer identities in minutes – all from within their Databricks lakehouse.
This press release features multimedia. View the full release here: https://www.businesswire.com/news/home/20250609092916/en/
As pressure mounts to deliver business-ready insights quickly, data engineers are hitting a wall: while infrastructure has modernized, the work of preparing customer data still relies on manual code and brittle rules-based systems. Chuck changes that by enabling data engineers to “vibe code” – using natural language prompts to delegate complex engineering tasks to an AI assistant.
This session explores how AI agents are transforming data engineering by automating complex workflows such as data ingestion, transformation, and pipeline orchestration. With real-time analytics and intelligent decision-making becoming critical, AI-driven automation is enabling greater efficiency, scalability, and accuracy in data processes.
From automated anomaly detection to schema evolution and performance optimization, discover practical use cases that showcase the power of AI in simplifying and future-proofing data engineering strategies. Ideal for data engineers, architects, and AI enthusiasts, this talk offers insights into leveraging AI agents to reduce operational overhead and stay ahead in the era of intelligent automation.
Data Engineering Digest brings together the best content for Data Engineering Professionals from the widest variety of industry thought leaders. It is a combined effort of the Data Science Council of America and Aggregage. The goals of the site and newsletter are to:
Collect High Quality Content – The goal of a content community is to provide a high quality destination that highlights the most recent and best content from top quality sources as defined by the community.
Provide an Easy to Navigate Site and Newsletter – Our subscribers are often professionals who are not regular readers of the blogs and other sources. They come to the content community to find information on particular topics of interest to them. This links them across to the sources themselves.
Be a Jump Off Point – To be clear all our sites/newsletters are only jump off points to the sources of the content.
Help Surface Content that Might Not be Found – It’s often hard to find and understand blog content that’s spread across sites. Most of our audience are not regular subscribers to these blogs and other content sources.