We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
With the massive growth of machine learning (ML)-backed services, the term MLops has become a regular part of the conversation — and with good reason. Short for “machine learning operations,” MLops refers to a broad set of tools, work functions and best practices to ensure that machine learning models are deployed and maintained in production reliably and efficiently. Its practice is core to production-grade models — ensuring quick deployment, facilitating experiments for improved performance and avoiding model bias or loss in prediction quality. Without it, ML becomes impossible at scale.
With any up-and-coming practice, it’s easy to be confused about what it actually entails. To help out, we’ve listed seven common myths about MLops to avoid, so you can get on track to leverage ML successfully at scale.
Myth #1: MLops ends at Launch
Reality: Launching an ML model is just one step in a continuous process.
ML is an inherently experimental practice. Even after initial launch, it’s necessary to test new hypotheses while fine-tuning signals and parameters. This allows the model to improve in accuracy and performance over time. MLops processes help engineers manage the experimentation process effectively.
For example, a core component of MLops is version management. This allows teams to track key metrics across a wide set of model variants to ensure the optimal one is selected, while allowing for easy reversion in the event of an error.
It’s also important to monitor model performance over time due to the risk of data drift. Data drift occurs when the data a model examines in production shifts dramatically from the data the model was originally trained on, leading to poor quality predictions. For example, many ML models that were trained for pre-COVID-19 pandemic consumer behavior degraded severely in quality after the lockdowns changed the way we live. MLops works to address these scenarios by creating strong monitoring practices and by building infrastructure to adapt quickly if a major change occurs. It goes far beyond launching a model.
Myth #2: MLops is the same as model development
Reality: MLops is the bridge between model development and the successful use of ML in production.
The process used to develop a model in a test environment is typically not the same one that will enable it to be successful in production. Running models in production requires robust data pipelines to source, process and train models, often spanning across much larger datasets than ones found in development.
Databases and computing power will typically need to move to distributed environments to manage the increased load. Much of this process needs to be automated to ensure reliable deployments and the ability to iterate quickly at scale. Tracking also must be far more robust as production environments will see data outside of what is available in test, and hence the potential for the unexpected is far greater. MLops consists of all of these practices to take a model from development to a launch.
Myth #3: MLops is the same as devops
Reality: MLops works towards similar goals as devops, but its implementation differs in several ways.
While both MLops and devops strive to make deployment scalable and efficient, achieving this goal for ML systems requires a new set of practices. MLops places a stronger emphasis on experimentation relative to devops. Unlike standard software deployment, ML models are often deployed with many variants at once, hence there exists a need for model monitoring to compare between them to select an optimal version. For each redeployment, it’s not sufficient just to land the code — the models need to be retrained every time there is a change. This differs from standard devops deployments, as the pipeline now must include a retraining and validation phase.
For many of the common practices of devops, MLops extends the scope to address its specific needs. Continuous integration for MLops goes beyond just testing of code, but also includes data quality checks along with model validation. Continuous deployment is more than just a set of software packages, but now also includes a pipeline to modify or roll back changes in models.
Myth #4: Fixing an error is just changing lines of code
Reality: Fixing ML model errors in production requires advance planning and multiple fallbacks.
If a new deployment leads to a degradation in performance or some other error, MLops teams need to have a suite of options on hand to resolve the issue. Simply reverting to the previous code is often not sufficient, given that models need to be re-trained before deployment. Instead, teams should keep multiple versions of models at hand, to ensure there is always a production-ready version available in case of an error.
Moreover, in scenarios where there is a loss of data, or a significant shift in the production data distribution, teams need to have simple fallback heuristics so that the system can at least keep up some level of performance. All of this requires significant prior planning, which is a core aspect of MLops.
Myth #5: Governance is fully distinct from MLops
Reality: While governance has distinct goals from MLops, much of MLops can help support governance objectives.
Model governance manages the regulatory compliance and risk associated with ML system use. This includes things like maintaining appropriate user data protection policies and avoiding bias or discriminatory outcomes in model predictions. While MLops is typically seen as ensuring that models are delivering performance, this is a narrow view of what it can deliver.
Tracking and monitoring of models in production can be supplemented with analysis to improve the explainability of models and find bias in results. Transparency into model training and deployment pipelines can facilitate data processing compliance goals. MLops should be seen as a practice to enable scalable ML for all business objectives, including performance, governance and model risk management.
Myth #6: Managing ML systems can be done in silos
Reality: Successful MLops systems require collaborative teams with hybrid skill sets.
ML model deployment spans many roles, including data scientists, data engineers, ML engineers and devops engineers. Without collaboration and understanding of each other’s work, effective ML systems can become unwieldy at scale.
For instance, a data scientist may develop models without much external visibility or inputs, which can then lead to challenges in deployment due to performance and scaling issues. Perhaps a devops team, without insight into key ML practices, may not develop the appropriate tracking to enable iterative model experimentation.
This is why, across the board, it’s important that all team members have a broad understanding of the model development pipeline and ML practices — with collaboration starting from day one.
Myth #7: Managing ML systems is risky and untenable
Reality: Any team can leverage ML at scale with the right tools and practices.
As MLops is still a growing field, it can seem as though there is a great deal of complexity. However, the ecosystem is maturing rapidly and there is a swath of available resources and tools to help teams succeed at each step of the MLops lifecycle.
With the proper processes in place, you can unlock the full potential of ML at scale.
Krishnaram Kenthapadi is the chief scientist at Fiddler AI.