Did you miss a session from the Future of Work Summit? Head over to our Future of Work Summit on-demand library to stream.
Let the OSS Enterprise newsletter guide your open source journey! Sign up.
The Linux Foundation, the nonprofit consortium that provides a vendor-neutral hub for open source projects. today announced that McKinsey’s QuantumBlack will donate Kedro, a machine learning pipeline tool, to the open source community. The Linux Foundation will maintain Kedro under Linux Foundation AI & Data (LF AI & Data), an umbrella organization founded in 2018 to bolster innovation in AI by supporting technical projects, developer communities, and companies.
“We’re excited to welcome the Kedro project into LF AI & Data. It addresses the many challenges that exist in creating machine learning products today and it is a fantastic complement to our portfolio of hosted technical projects,” Ibrahim Haddad, executive director of LF AI & Data, said. “We look forward to working with the community to grow the project’s footprint and to create new collaboration opportunities with our members, hosted projects and the larger open-source community.”
The importance of pipelines
A machine learning pipeline is a construct that orchestrates the flow of data into — and out of — a machine learning model. Pipelines encompass raw data, data processing, predictions, and variables that fine-tune the behavior of the model with the goal of codifying the workflow so that it can be shared across an organization.
Many machine learning pipeline creation tools exist, but Kedro is relatively new to the scene. Launched in 2019 by McKinsey, it’s a framework written in Python that borrows concepts from software engineering and brings them to the data science world, laying the groundwork for taking a project from an idea to a finished product.
According to Yetunde Dada, product lead on Kedro, Kedro was developed to address the main shortcomings of one-off scripts and “glue-code” by focusing on creating maintainable, efficient data science code. By building in modularity, one of the aims was to inspire the creation of reusable analytics code and enhance team collaboration.
In the two-and-a-half years Kedro has been available on GitHub, the community and user base has grown to over 200,000 monthly downloads and more than 100 contributors. Telkomsel, Indonesia’s largest wireless network provider, uses Kedro as a standard across its data science organization.
“This is the only way [Kedro] can grow at this point — if it is improved by the best people around the world,” Dada said in a statement. “Our cross-disciplinary team of 15 people gets to own increased development and validation of Kedro with this milestone. It is also significant mark of validation for Kedro as a de-facto industry tool, joining a collection of other cutting-edge open-source projects such as Kubernetes donated by Google, GraphQL by Facebook or MLFlow and Delta Lake by Databricks.”
Open source software has become ubiquitous in the enterprise, where it’s now used even in mission-critical settings. While the integrity of the software is in question — particularly in light of recent events — seventy-nine percent of companies expect that their use of open source software for emerging technologies will increase over the next two years, according to a 2021 Red Hat survey.
According to Schwarzmann, after it’s open-sourced, Kedro will continue to be the foundation of analytics projects within McKinsey. “The ideas and guardrails that exist in Kedro are a reflection of that experience and are designed to help developers avoid common pitfalls and follow best practices,” product manager Joel Schwarzmann said in a blog post.
A spokesperson added via email: “Kedro will be focused on pursuing a stable API, or 1.0 version, formal integrations with developer tools and cloud platforms and continued work on our experiment tracking functionality. We want our users also to have surety that it is easy to upgrade versions of Kedro and benefit from new features. At this moment, Kedro supports elementary integrations with different cloud providers, and we want to work with the cloud providers to create seamless integrations. Experiment tracking, a way for data scientists to keep track of data science experiments, has paved the way for users to find and promote production models. We will be extending this functionality with many more features according to user problems.”
Kedro joins another open source pipeline tool released by Microsoft in November: SynapseML. With SynapseML, as with Kedro, developers can build systems for solving challenges across domains including text analytics, translation, and speech processing.