:probabl.blog

The open source tools of enterprise data science: A conversation with Merel Theisen

Written by Marie Sacksick | Tuesday, March 31 2026


By Marie Sacksick, Director of Market Intelligence at Probabl

Ask any enterprise data science team what slows them down, and you'll probably hear similar answers. Notebooks that work locally but fall apart in production. Pipeline code that only one person truly understands. New hires who spend their first weeks reverse-engineering what the team before them built. Exploring input data and cleaning it. These aren't exotic edge cases – they are the norm.

This is where open source tools – from Skore by Probabl to Kedro by QuantumBlack – are making the difference for enterprise data science teams. This week, I sat down with Merel Theisen, Tech Lead of Kedro at QuantumBlack, to discuss how open source tools drive lasting value for enterprise data science teams.

Building for enterprise data science teams

At Probabl, we are driven by the conviction that the data science industry is failing enterprises – not for lack of compute or data, but for lack of rigor and structure. Most models never reach production. Reproducibility remains an aspiration. Knowledge walks out the door with every departing data scientist or engineer. And a generation of automated tools that promise magic instead deliver opacity, technical debt, and lock-in.

This conviction is inseparable from where we come from. Probabl was founded by the creators and maintainers of scikit-learn, the most downloaded Python library for machine learning. In March 2026 alone, scikit-learn was downloaded over 200 million times. We don’t observe the data science world from the outside. Our founders built the open source infrastructure it runs on, and keep doing so. And that history shapes what we do.

We’re building tools for the enterprises that truly want to own their data science – those that prize tools that build institutional knowledge rather than concentrating it in black boxes or third-party platforms. As our CEO François Méro wrote recently, our four guiding principles are: (1) science first, (2) composability, (3) reusability, and (4) transparency. 

We’re putting these principles into practice with Skore, available as an open source library and as an enterprise platform that empower data science teams to collaborate, scale their practice, and increase the impact of their AI projects.

None of this is built in isolation, though. The tools from the broader open source ecosystem – and the vibrant communities that maintain them at the state of the art – are essential to how enterprises can own their data science.

Kedro, the Python framework for production-ready data pipelines, is an important piece of that puzzle. By giving teams a standardized project structure and a principled way to build production-ready data pipelines, it addresses many of the same structural problems we think about at Probabl every day: how to move from individual heroics to institutional practice, from one-off experiments to reproducible, auditable systems.


A conversation with Merel Theisen

To learn more about the design choices and vision guiding Kedro, I sat down with Merel Theisen, Tech Lead of Kedro and Principal Software Engineer at QuantumBlack. We discussed how Kedro is built and why, what a healthy open source data science ecosystem actually looks like in practice, and how tools like Kedro and Skore create value for enterprises.

Marie Sacksick: Merel, for someone coming into this with zero context: what is Kedro and what problems does it solve for enterprise data science teams?

Merel Theisen: Kedro is an open source Python framework hosted by the Linux Foundation. It brings software engineering best practices to data science and data engineering, giving teams a standardized way to build production-ready data pipelines. For enterprise teams specifically, it solves some very real pain points: inconsistent project structures across teams, code that works in notebooks but falls apart in production, and the difficulty of collaborating on pipeline code when everyone has their own way of doing things. Kedro gives you a common foundation. This way teams can focus on the actual data science rather than reinventing project scaffolding every time.

Marie Sacksick: Data scientists rarely use a single tool in a vacuum. You may stitch together Kedro for data pipelining, scikit-learn for training machine learning models, SHAP for interpretability, and MLflow for monitoring models once they’re in production. Can you give us a sneak peak into a time you’ve seen Kedro used with tools like scikit-learn to drive real-world impact – what was the problem, and what was the outcome?

Merel Theisen: One great example is a large Brazilian independent broker that had no formal data science practice when they started out. Their main challenge was a classic one: every data scientist built pipelines their own way, and the typical workflow meant shipping notebooks straight to production. They'd tried adopting tools like MLflow but couldn't get adoption due to coding overhead.

The team adopted Kedro and it clicked for them because it met them where they were. It gave them standardized project structure, encouraged good software engineering practices, and let them think about models as proper software artifacts rather than one-off notebook experiments.

What's interesting is what happened next. Once Kedro was in place as that foundational layer, adopting other MLOps tools became much easier. MLflow for experiment tracking, Great Expectations for data validation, these tools slotted in naturally because the team already had clean, structured pipelines to integrate them with.

Marie Sacksick: When you develop new features for Kedro, how much do you prioritize interoperability with tools in the wider Python data science ecosystem? And going one step further, what does a healthily integrated Python data science ecosystem look like to you?

Merel Theisen: Kedro is designed to be tool- and platform-agnostic, so it slots into existing data stacks easily. As a Python library, tools like pandas, scikit-learn, and LangChain work natively inside Kedro projects. We also offer hooks, plugins, and kedro-datasets, our community-driven data connectors, to extend functionality further. A healthy ecosystem, to me, is one where tools complement each other and users can leverage the best of each without friction.

Marie Sacksick: At Probabl, we recently launched Skore Hub, a platform that extends our open source library Skore and enables data science teams to easily track, explore, and share their data science workflows. What value do you see Kedro and Skore, when used together, creating for enterprise data science teams?

Merel Theisen: To me, Kedro and Skore address different but complementary stages of the data science workflow. Kedro provides the pipeline structure: how data flows, how code is organised, how projects scale. Skore, as I understand it, focuses on model development quality, such as evaluation reports, methodological diagnostics, and cross-validation insights. I think together they'd give enterprise teams both structured, reproducible pipelines and rigorous model evaluation with built-in best practices, which is exactly the combination needed to move from experimentation to production confidently.

Marie Sacksick: Open source thrives on collaboration, yet many enterprise users are consumers rather than upstream contributors. Could you give us a sneak peak into how you and your team have successfully encouraged others to move from just using Kedro to actually contributing to it? Based on your learnings, what is your go-to advice for enterprises that steward core Python libraries for data science and AI?

Merel Theisen: Before open-sourcing Kedro, we established strong internal standards around code quality and testing. The challenge was maintaining that bar without discouraging contributions. We invested in clear contribution guides, streamlined developer setup, and responsive PR reviews, as people shouldn't be left waiting. We also created tiered contribution paths: kedro-datasets is an easy entry point, and our experimental dataset tier lowers the bar further, letting contributors share ideas without needing to fully polish them. My advice: make contributing feel achievable, respond quickly, and offer varied entry points for different commitment levels.

Marie Sacksick: Last but not least, how would you pitch scikit-learn to CEOs who want to leverage the power of AI in their businesses?

Merel Theisen: I'd pitch scikit-learn as the most battle-tested ML library in the Python ecosystem. It's open source, widely adopted, and covers the vast majority of practical ML use cases. And naturally, it works seamlessly inside Kedro projects, so teams get structured pipelines with best-in-class ML tooling out of the box!

 

About Merel Theisen

Merel Theisen is a Principal Software Engineer at QuantumBlack, where she is currently the tech lead of Kedro, an open source project part of the Linux Foundation. Merel has over ten years of experience in the software industry, with most of her career focused on backend product engineering. Merel is passionate about building products that solve real user problems, and cares deeply about creating robust, well-tested software that follows good engineering principles. Merel is also a strong advocate for open source software, and finds working with the community to be both inspiring and energizing.


For more from Probabl