Tuesday, June 16 2026

We’re teaching agents the core skills of data science: Methodology and statistical thinking

9:48

By Yann Debray, CPO of Probabl &
Gaël Varoquaux, CSO of Probabl & Research Director at Inria

TL;DR

Agents have already transformed software engineering, and the transformation of data science is on the horizon. Data scientists are already coding faster thanks to AI coding assistants, one prompt at a time, but with agentic data science we’ll be able to oversee agents executing loops of iterations though our workflows and experiments. However, currently agents excel at only one of three core skills in data science: engineering. Agents lack skills in methodology and statistical thinking – which distinguish data science as practice from engineering. If we want to see the productivity uplift of agentic in data science and trust the outcomes, we need to teach agents the core skills of data science. That’s precisely what we’re doing at Probabl.

Agents transformed engineering, but what about data science?

We owe Andrej Karpathy of OpenAI credit for two terms that we hear or use almost every day. In February 2025, Andrej coined “vibecoding” [1]. That’s when you prompt your AI coding assistant to generate code for you, or as he put it “fully give in to the vibes” of the LLM. Vibecoding works one iteration at a time: you prompt your coding assistant, you get output.

In February 2026, he proclaimed that “agentic engineering” [2] was his favorite term for describing a new way of using AI in software engineering: orchestrating agents that do 99% of the coding while you provide the oversight.

In April, at Sequoia Ascent 2026, Andrej distinguished the two related but different concepts [3]:

“1. Vibe coding raises the floor. It lets almost anyone create software by describing what they want.

2. Agentic engineering raises the ceiling. It is the professional discipline of coordinating fallible agents while preserving correctness, security, taste, and maintainability.

Vibe coding is fine for prototypes and personal tools. Agentic engineering is what serious teams need.”

While the agentic transformations have been happening in software engineering, there’s been much discussion in the data science community about what the agentic wave means for the practice of data science.

For example, this is how IBM describes the anticipated shift to agentic data science [4]:

“For the past two decades, the role of a data scientist has been about mastering data pipelines, algorithms, and tools to transform raw data into insights. But with the rise of agentic AI systems, frameworks where autonomous agents collaborate, reason, and take actions, data science is undergoing a radical shift. The Agentic Data Scientist isn’t just a human role anymore; it’s becoming a hybrid of human expertise and LLM-powered agents that can plan, code, execute, and self-improve data pipelines.”

The reality today is that we, data scientists, are already weaving AI into our workflows. We’re coding faster thanks to AI coding assistants, one prompt at a time. The vision for agentic data science is that we’ll be able to oversee agents that execute loops of iterations through our workflows and experiments. In Andrej’s words: while AI coding tools raise the floor (specifically, regarding coding), agents promise to raise the ceiling for the practice of data science.

The gap: Data science agents need to know more than coding

We agree about the transformational opportunity of agentic data science. So, why haven’t we seen it taken off yet like agentic engineering has?

The short answer is: It’s because data science is not software engineering. Data science demands three core skills: engineering, methodology, and statistical thinking. Agents are strong at coding. Methodology and statistical thinking are where they fall short and, as we’ve written on our blog before [5], where the value of the data scientist has always been concentrated.

“Success” in software engineering and data science are different. In software engineering, a program that runs is largely a program that works – given a bit of wrestling to build a good test suite. In data science, success is contextual rather than binary. Decisions must be made and judgment is required across data, features, models, pipelines, and reports. And these decisions can produce different outcomes. This was shown, for example, by a study of 29 teams involving 61 analysts who used the same data set to address the same research question: whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players [6]. Unsurprisingly, they got different results.

In software engineering, bugs announce themselves with errors and failing tests. In data science that safety net mostly disappears. Agents can produce entire machine learning pipelines in seconds. A pipeline can execute flawlessly and render clean plots. But without the right methodology or statistical thinking, the results can be misleading. Our colleague Arturo showed this in a demo recently. An agent produced perfectly functional code that ran PCA on raw OHE categorical variables. The pipeline executed. The plots rendered. In the process, infrequent categories ended up being silently dropped. But these infrequent categories could be useful for the task at hand, and dropping them could introduce bias and fairness issues. The error is that PCA knows nothing about the task at hand; it’s unsupervised.

Errors cluster where there is no clean metric to tell the agent whether it got things right. Data leakage is the clearest case: scaling before the train/test split, target leakage, resampling before splitting. For example, an agent optimizing for accuracy may not be neutral about leakage. In fact, it could be drawn toward it because leakage can inflate the very score the agent is optimizing for. Catching errors like this means questioning a number that looks good, and that requires statistical reasoning and domain expertise.

The same holds at the other end of the workflow: a score like 0.91 AUC says nothing on its own about which business decisions should change or what the model is worth in practice. Translating that number into real-world impact takes human judgment and domain expertise.

Hugo Bowne-Anderson made a similar point in a recent episode on his podcast Vanishing Gradients [7]. He suggests agentic data science hasn’t yet taken off due to the risk of "vibe scientists", who can't verify what models produce is real and significant. Eric Ma put it similarly on his blog [8]: “our core mission remains rigorous measurement, not full-stack development. While AI tools make building easier, the real value comes from defining and evaluating what truly matters.”

In essence, if we want to see the productivity uplift of agentic in data science and trust the outcomes, we need to teach agents to excel at all three core skills of data science.

We’re teaching agents the core skills of data science

This is precisely the gap our field has spent twenty years learning to close in human practitioners: knowing when the data is misleading, knowing what a model can and cannot generalize to, knowing which method the question actually calls for.

When scikit-learn was built on top of Python and the NumPy/SciPy stack, the goal was to give machine learning an organization – a grammar. Because it enables black-box model application and evaluation, the fit/predict contract took a sprawling, heterogeneous research field and made it composable: any estimator, any pipeline, the same two verbs. It is hard to overstate how much of today's practice rests on that simplicity, which is why we'd argue scikit-learn is one of the most influential – and sometimes underappreciated – tool in AI and data science. Millions of practitioners learned to think in scikit-learn before they learned to think in anything else.

Seen this way, scikit-learn did to machine learning what harnesses do to LLMs: channels data science practice toward scientifically sound methodology. And the lineage continues. skrub does for messy data pipelines what scikit-learn did for estimators; bringing principled structure to the wrangling that consumes most of a data scientist's time. skore does the same for experimentation: evaluation, comparison, and reporting with guardrails built in. And tabicl is the fully open source tabular foundation model that complies with the scikit-learn API and achieves SOTA predictive performance straight out of the box.

Now that agents are doing data science, they need a harness, too: encoded best practices – call them skills – that steer these immensely capable tools toward statistically sound methodology, the way fit/predict has steered a generation of practitioners.

That is the project. Not replacing the data scientist – but equipping the agent with what data scientists already know, and freeing the data scientist to do what the agent cannot.

We would rather build that future than predict it.

The work starts at probabl.ai/skills.