Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.
More often than not, when organizations deploy applications across hybrid and multicloud environments, they use the open-source Kubernetes container orchestration system.
Kubernetes itself helps to schedule and manage distributed virtual compute resources and isn’t optimized by default for any one particular type of workload, that’s where projects like Kubeflow come into play.
For organizations looking to run machine learning (ML) in the cloud, a group of companies including Google, Red Hat and Cisco helped to found the Kubeflow open-source project in 2017. It took three years for the effort to reach the Kubeflow 1.0 release in March 2020, as the project gathered more supporters and users. Over the last two years, the project has continued to evolve, adding more capabilities to support the growing demands of ML.
This week, the latest iteration of the open-source technology became generally available with the release of Kubeflow 1.6. The new release integrates security updates and enhanced capabilities for managing cluster serving runtimes for ML, as well as new ways to more easily specify different artificial intelligence (AI) models to deploy and run.
MetaBeat will bring together thought leaders to give guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.
“Kubeflow is an open-source machine learning platform dedicated to data scientists who want to build and experiment with machine learning pipelines, or machine learning engineers who deploy systems to multiple development environments,” Andreea Munteanu, product manager for AI/ML, Canonical, told VentureBeat.
The challenges of using Kubernetes for ML
There is no shortage of potential challenges that organizations can face when trying to deploy ML workloads in the cloud with Kubernetes.
For Steven Huels, senior director, AI product management and strategy at Red Hat, the biggest issue isn’t necessarily about the technology, it’s about the process.
“The biggest challenges we see from users related to data science and machine learning is repeatability — namely, being able to manage the model lifecycle from experimentation to production in a repeatable way,” Huels said.
Huels noted that the integration of a model experimentation environment through to the serving and monitoring environment helps make this consistency more achievable, letting users see value from their data science experiments while pipelines make these workflows repeatable over time.
In June of this year the Kubeflow Community Release Team issued a User Survey Review report that identified a number of key challenges for machine learning. Of note, only 16% of respondents noted that all ML models they worked on in 2021 were successfully deployed into production and were able to deliver business value. The survey also found that it takes more than five iterations of a model before it ever makes it into production. On a positive note, 31% of respondents did state that the average life of a model in production was six months or more.
The user survey also identified that data preprocessing is one of the most consuming aspects of ML.
What’s new in Kubeflow 1.6
Canonical’s Munteanu commented that the Kubeflow 1.6 update is taking specific steps to help address some of the issues that the user survey identified.
For example, she noted that Kubeflow 1.6 makes data processing more seamless and offers better tracking capabilities, with improvements to the metadata. Moreover, Munteanu added that the latest release brings improved tracking for trial logs as well, allowing for efficient debugging in case of data source failure.
In an effort to help more models to actually be product ready, Munteanu said that Kubeflow 1.6 supports population-based training (PBT), accelerating model iteration and improving the likelihood that models will reach production readiness.
There have also been enhancements made to the Message Passing Interface (MPI) operator component that can help make training large volumes of data more efficient. Munteanu also noted that PyTorch elastic training enhancements make model training more effective and help ML engineers get started quickly.
What’s next for Kubeflow
There are multiple vendors and services that integrate Kubeflow. For example, Canonical has what it calls Charmed Kubeflow, which provides a package and automated approach to running Kubeflow using Ubuntu’s Juju framework. Red Hat integrates Kubeflow components into its OpenShift Data Science product.
The direction of the Kubeflow project isn’t driven by any one contributor or vendor.
“Kubeflow is an open-source project that is developed with the help of the community, so its direction is ultimately going to come out of discussions within the community and the Kubeflow project,” Munteanu said.
Munteanu commented that Canonical, when thinking about Charmed Kubeflow, is focusing on security and also on streamlining user onboarding. In relation to Charmed Kubeflow, she said that Canonical is looking to integrate the product with other AI/ML-specific apps that enable AI/ML projects to go to production and to scale.
“We see Kubeflow’s future as being an essential part of a wider, ecosystem-based solution that addresses AI/ML projects and solves a challenge that many companies do not have the resources to address currently,” Munteanu said.