Microsoft open-sources SynapseML for developing AI pipelines

November 18, 2021

1857 Views 0

SaveSavedRemoved 0

Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more

Microsoft today announced the release of SynapseML (previously MMLSpark), an open source library designed to simplify the creation of machine learning pipelines. With SynapseML, developers can build “scalable and intelligent” systems for solving challenges across domains, including text analytics, translation, and speech processing, Microsoft says.

“Over the past five years, we have worked to improve and stabilize the SynapseML library for production workloads. Developers who use Azure Synapse Analytics will be pleased to learn that SynapseML is now generally available on this service with enterprise support [on Azure Synapse Analytics],” Microsoft software engineer Mark Hamilton wrote in a blog post.

Scaling up AI

Building machine learning pipelines can be difficult even for the most seasoned developer. For starters, composing tools from different ecosystems requires considerable code, and many frameworks aren’t designed with server clusters in mind.

Despite this, there’s increasing pressure on data science teams to get more machine learning models into use. While AI adoption and analytics continue to rise, an estimated 87% of data science projects never make it to production. According to Algorithmia’s recent survey, 22% of companies take between one and three months to deploy a model so it can deliver business value, while 18% take over three months.

SynapseML aims to address the challenge by unifying existing machine learning frameworks and Microsoft-developed algorithms in an API, usable across Python, R, Scala, and Java. SynapseML enables developers to combine frameworks for use cases that require more than one framework, such as search engine creation, while training and evaluating models on resizable clusters of computers.

As Microsoft explains on the project’s website, SynapseML expands Apache Spark, the open source engine for large-scale data processing, in several new directions. “[The tools in SynapseML] allow users to craft powerful and highly-scalable models that span multiple [machine learning] ecosystems. SynapseML also brings new networking capabilities to the Spark ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models and use their Spark clusters for massive networking workflows.”

SynapseML also enables developers to use models from different machine learning ecosystems through the Open Neural Network Exchange (ONNX), a framework and runtime co-developed by Microsoft and Facebook. With the integration, developers can execute a variety of classical and machine learning models with only a few lines of code.

Beyond this, SynapseML introduces new algorithms for personalized recommendation and contextual bandit reinforcement learning using the Vowpal Wabbit framework, an open source machine learning system library originally developed at Yahoo! Research. In addition, the API features capabilities for “unsupervised responsible AI,” including tools for understanding dataset imbalance (e.g., whether “sensitive” dataset features like race or gender are over- or under-represented) without the need for labeled training data and explainability dashboards that explain why models make certain predictions — and how to improve the training datasets.

Where labeled datasets don’t exist, unsupervised learning — also known as self-supervised learning — can help to fill the gaps in domain knowledge. For example, Facebook’s recently announced SEER, an unsupervised model, trained on a billion images to achieve state-of-the-art results on a range of computer vision benchmarks. Unfortunately, unsupervised learning doesn’t eliminate the potential for bias or flaws in the system’s predictions. Some experts theorize that removing these biases might require a specialized training of unsupervised models with additional, smaller datasets curated to “unteach” biases.

“Our goal is to free developers from the hassle of worrying about the distributed implementation details and enable them to deploy them into a variety of databases, clusters, and languages without needing to change their code,” Hamilton continued.

Originally appeared on: TheSpuzz