Where open source tooling for data science and AI is heading: A conversation with Gaël Varoquaux and Matt White

Written by Cailean Osborne | Tuesday, April 7 2026

With PyTorch Conference Europe kicking off today, it’s a good moment to take stock of where open source tooling for data science and artificial intelligence (AI) stands, and where it's headed. To do just that, I sat down with Gaël Varoquaux and Matt White, and picked their brains about the trends shaping the ecosystem and the priorities close to their hearts.

Few people have a better read on where open source tooling for data science and AI is heading – or why it matters – than Gaël and Matt. Gaël is our CSO at Probabl, Research Director at Inria, and a creator and core maintainer of scikit-learn. Matt is the Global CTO of AI at the Linux Foundation and the CTO of both the PyTorch Foundation and the Agentic AI Foundation.

Among the many hats they wear, Gaël and Matt play key roles in scikit-learn and PyTorch — the most used Python libraries for machine learning (ML) and deep learning (DL) respectively. To put their dominance into numbers: as of today, according to publicly available download data, scikit-learn has been downloaded over 4.3 billion times and PyTorch over 1.6 billion times. And to visualize those numbers: the chart below shows that scikit-learn and PyTorch are truly in a league of their own when you compare their yearly downloads with popular alternatives for ML and DL.

In our conversation, we covered a lot of ground, including how open source tools like scikit-learn are democratizing ML for millions of daily users; enterprise demand for openness wherever AI intersects with strategic control considerations; and the opportunities at the intersection of open source ML tooling and agentic AI; and much more.

Whether you're a data scientist, ML engineer, or just trying to make sense of where open source AI is heading, there's something here for you.

Figure 1: Yearly downloads (PyPi and Conda) of Python libraries for ML and DL

A conversation with Matt White and Gaël Varoquaux

Cailean Osborne: Gaël, let me start with you. You’ve been a key player in building open source tools for data science and ML ever since I can remember – from scikit-learn (which, as of this morning, has been downloaded over 4.3 billion times!) to SOTA tabular foundation models like TabICLv2. For someone who's new to this space, can you quickly bring us up to speed about the kinds of open source tools you’ve helped to build?

Gaël Varoquaux: I’ve worked on many projects around data science, ML, and AI. I think that the project that has had the most impact is not the one that has been seen as the most impressive. It is not pushing the bleeding-edge of tabular AI, as with our table foundation models. Rather, it is democratizing classic tools – tools that are the basis for everything – via the scikit-learn project. What we’ve always tried, in scikit-learn and in all our tools, is to empower people working with data, to make ML easier. It’s about making sure that people can most efficiently go from their data to the answer or the impact that they are interested in.

Cailean Osborne: A couple of weeks ago at NVIDIA GTC, Jensen Huang argued that structured data is the ground truth of AI in enterprises. He even called the structured data ecosystem a $120 billion opportunity. I’m sure his statement didn’t go unnoticed for you, given your involvement in developing and speeding up best-in-class open source tools for machine learning with structured data. So, I’m curious: which sectors do you think will benefit most from faster machine learning?

Gaël Varoquaux: It’s hard to tell: tabular data is really everywhere. Finance has always benefited a lot from improvement of predictions, but predictions are also important in industries where you have to manage stock – for example, retail and manufacturing – as well as in healthcare. What’s really important for an organization to reap benefits of faster, and better, ML, is to have a systematic eye to data: accumulating data on every aspect of the organization, inward and outward facing, and thinking critically about measurements, as measuring the important outcomes of interests is often difficult.

Cailean Osborne: Looking ahead, which priorities are you most excited about for the future of the open source tools that you’re involved in building?

Gaël Varoquaux: On the scikit-learn front, one exciting development are the callbacks; they will bring progress bars or early stopping. We also have more UX developments with increasingly rich displays in notebooks or vscode as well as acceleration on dedicated hardware – CPUs or GPUs. I’m also very excited about the progress in the broader ecosystem. As we are developing skrub’s data transformation pipelines, we tune increasingly complex assembly. Our next important milestones are caching and support of time-related transformations. And skore is increasingly helpful to keep track of the numerous evaluations and experiments that a data science project involves.

Cailean Osborne: Thanks, Gaël. Now, turning to you, Matt. You recently broadened your scope from Executive Director of the PyTorch Foundation to Global CTO of AI at the Linux Foundation, covering the full gamut of open source and open standards from deep learning with PyTorch to agentic AI and emerging applications of AI. This gives you a unique bird's eye view of the global open source AI ecosystem of today and where it's heading tomorrow. Could you talk us through the top 3 trends that we should be following?

Matt White: The first trend is the shift from models that simply generate content to models that can take action. We are moving from assistant behavior toward systems that can reason, plan, use tools, maintain state, and operate across multiple interfaces. What is especially interesting now is that more of this capability is being pulled into the model layer itself. In other words, some of the behaviors that previously lived primarily in orchestration frameworks are increasingly becoming native model capabilities.

The second trend is the rise of the personalized and domain-aware agent. Enterprises are no longer looking for a generic chatbot; they want systems that understand their workflows, policies, data boundaries, and operating context. That means we are seeing increasing demand for agents that can be grounded in enterprise knowledge, deployed with strong governance, and tailored to specific roles such as engineering, compliance, customer support, or scientific research. The real story is not just autonomy, but useful autonomy inside bounded, accountable environments.

The third trend is the growing competitiveness of smaller, specialized open models. The industry spent the last two years focused on scale, but we are now seeing a more nuanced market emerge: highly capable domain-specific models, and increasingly efficient open models, are becoming viable for many real production tasks. That matters because enterprises often do not need the biggest possible model; they need the most efficient, governable, and cost-effective model that performs well for their use case. I expect that trend to accelerate, especially in regulated sectors and edge deployments where locality, auditability, and cost control are not optional.

Taken together, those three shifts point to the next phase of AI adoption: not just larger models, but more operationally useful systems built from open components, open standards, and increasingly capable open models, the ingredients for composable intelligence. The Linux Foundation is also positioning AI as part of a wider open ecosystem spanning software, standards, and data, while highlighting the need for interoperable standards and the infrastructure needed to reduce friction and increase adoption of open agentic AI.

Cailean Osborne: Fascinating, thanks for sharing those trends. Picking up on your last point: The open AI ecosystem is getting more and more dynamic every day. Nowadays AI researchers, data scientists, and enterprises have SOTA open source tools and open-weight models at their fingertips. I’m curious to hear your thoughts on where you think open source AI systems are now ahead of proprietary systems, and where not?

Matt White: Open source is already ahead in several of the most important layers of the AI stack. The modern AI ecosystem runs on open infrastructure: Linux, Kubernetes, PyTorch, JAX, open data tooling, open observability, and increasingly open inference infrastructure as well. On the deployment side, open systems give organizations more control over performance, cost, portability, security posture, and compliance. They also give enterprises the ability to inspect what they are running, shape roadmaps through contribution, and avoid being locked into a single vendor’s commercial and technical decisions. The Linux Foundation itself describes this as the foundational layer for modern infrastructure, and PyTorch continues to sit at the center of much of the model development ecosystem.

Open models have also closed the gap significantly. A few years ago, the gap between leading proprietary models and leading open models could feel very large. Today, that gap is much narrower, and in some domains open models are already strong enough to be the rational enterprise choice. Recent open model families such as Qwen3, GLM-5, and MiniMax M2.7 demonstrate how far reasoning, tool use, long context, and model-size diversity have advanced in the open.

Where proprietary systems still tend to lead is in highly integrated end-to-end productization at the frontier: the largest training runs, the most expensive post-training and evaluation pipelines, tightly coupled proprietary data flywheels, and polished consumer-grade user experiences delivered as a single managed service. Closed vendors can move very quickly when they control the whole stack and can amortize massive infrastructure investments across a global commercial platform.

So, my view is this: proprietary systems may still lead at the extreme frontier of vertically integrated AI products, but open source leads in the foundations that matter most for long-term industry adoption, namely interoperability, transparency, flexibility, and ecosystem-wide innovation. And over time, those are the attributes that tend to reshape markets.

Cailean Osborne: Looking to the future, what’s your prognosis for the one thing you expect to continue and the one thing you expect to change by the end of 2026 with regards to how enterprises will be using and/or engaging with tools in the open AI ecosystem?

Matt White: What will continue is enterprise demand for openness wherever AI intersects with strategic control. If AI is becoming part of core operations, then organizations will keep pushing for more visibility into models, more portability across environments, more leverage over cost, and more confidence in governance. That is especially true in regulated industries, public sector contexts, and any environment where AI outputs need to be explainable, auditable, and operationally dependable.

What I expect to change by the end of 2026 is the role open source AI will play inside production workflows. Today, many enterprises still think in terms of “using a model.” By the end of 2026, more of them will be thinking in terms of “operating an AI system” composed of models, tools, retrieval, policy layers, evaluation pipelines, and agents. Open source engagement will therefore move up the value chain: from experimenting with individual open models to building governed, modular, production-grade AI architectures on open foundations.

I also expect open models to become more common as the default choice for a wider range of enterprise tasks, not only where regulation requires local control, but also where economics do. When open-weight models are good enough, fast enough, and easier to deploy in a controlled way, they become a very attractive option. The result will not be a world in which proprietary AI disappears. It will be a more hybrid world in which enterprises are much more deliberate about when they buy intelligence as a service and when they build on open components they can control.

Cailean Osborne: Last but not least, what do you see as the most exciting opportunities for innovation in open source tooling at the intersection of a) tabular ML/AI and b) open source software and open standards for agents?

Matt White: This is one of the most exciting intersections in AI right now because so much real enterprise value still lives in structured data. A great deal of operational decision-making in finance, insurance, healthcare, manufacturing, retail, and the public sector is still driven by tabular workflows. So the opportunity is not to replace tabular ML with agentic systems; it is to combine them.

I see three especially promising areas.

First, agents can make tabular ML dramatically more accessible by helping users move from raw business questions to reproducible analyses, feature engineering, model selection, and evaluation workflows. That can make high-quality data science more available across an organization without lowering standards.

Second, tabular models can become a critical decision layer inside agentic systems. Many enterprise actions should not be driven by a general-purpose LLM alone. They should be informed by specialized predictive models trained on structured operational data. In that architecture, the agent orchestrates, but the tabular model contributes calibrated decision support where it is strongest. TFMs are also emerging as powerful tools for time-series analysis, forecasting, and anomaly detection, among other tabular data use cases.

Third, open standards will become increasingly important at that boundary. As agents interact with data systems, model endpoints, evaluation services, and governance controls, the industry will need common ways to describe tools, permissions, lineage, policy, and outcomes. That is where open ecosystems can create durable value: not just through great models and frameworks, but through the shared standards that allow these systems to work together safely and reliably.

So to me, the big opportunity is this: bringing the rigor, maturity, and business relevance of tabular ML into the emerging agentic stack. That is how we get beyond demos and into trustworthy, high-impact enterprise AI.

About Matt White

Matt White is the Global CTO of AI at the Linux Foundation and the CTO of the PyTorch Foundation and Agentic AI Foundation. He is a longtime AI technologist, researcher, and open source leader with more than 25 years of experience spanning AI, data, autonomous systems, and large-scale technology platforms. At the Linux Foundation, he works across the open AI ecosystem, from model development and infrastructure to open collaboration, governance, and emerging standards. He has also helped lead major industry and community efforts around open source AI and the broader future of interoperable and composable intelligent systems.

For more from Probabl

Follow our latest updates on LinkedIn
Subscribe to our monthly newsletter
Check out over 100 tutorial videos on our YouTube account
Level up your machine learning skills for free with Skolar

View full post