How open-source data labeling technology can mitigate bias

Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.

Data labeling is one of the most fundamental aspects of machine learning. It is also often an area where organizations struggle – both to accurately categorize data and reduce potential bias.

With data labeling technology, a dataset used to train a machine learning model is first analyzed and given a label that provides a category and a definition of what the data is actually about. While data labeling is a critical component of the machine learning process, recently it has also proven to be highly inconsistent, according to multiple studies. The need for accurate data labeling has fuelled a bustling marketplace of data labeling vendors.

Among the most popular data labeling technologies is the open-source Label Studio, which is backed by San Francisco-based startup Heartex. The new Label Studio 1.6 update being released today will provide users with new features to help better analyze and label data inside of videos.

According to Michael Malyuk, cofounder and CEO of Heartex, the challenge for most companies with artificial intelligence (AI) is having good data to work with.


MetaBeat 2022

MetaBeat will bring together thought leaders to give guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.

Register Here

“We think about labeling as a broader category of dataset developments and Label Studio is a solution that ultimately enables you to do any sort of dataset development,” Malyuk said.

Defining data labeling categories is a challenge

While the 1.6 release of Label Studio has a video player capability as the primary new feature, Malyuk emphasized that the technology is useful for any type of data including text, audio, time series and video.

Among the biggest issues with any labeling approach for all types of data is actually defining the categories used for data labels.

“Some people can name things one way, some people can name things a different way, but they essentially mean the same thing,” Malyuk said.

He explained that Label Studio provides taxonomies for labels that users can choose from to describe a piece of data, be it a text, audio or image file. If two or more people in the same organization label the same data differently, the Label Studio system will identify the conflict so that it can be analyzed and remediated. Label Studio provides both a manual conflict resolution system and an automated approach.

Vector database vs. data labeling?

The process of data labeling can often involve manual work, with humans assigning a label or validating that a label is accurate.

There are a number of approaches to automating the process, startup Lightly AI is using a self-supervised machine learning model that can integrate with Label Studio. Then there are vendors that will use a vector database to convert data into math, rather than using data labeling to identify data and its relationships.

Malyuk said that vector databases do have their uses and can be effective for doing tasks such as similarity searches. The problem, in his view, is that the vector approach isn’t as effective with unstructured data types such as audio and video. He noted that a vector database can make use of identification types for common objects.

“As soon as you start deviating from that common knowledge to something that is a little bit different, it’s going to become very complicated without manual labeling,” Malyuk said.

How data labeling can identify and mitigate AI bias

Bias in AI is an ongoing challenge that many in the industry are trying to combat. At the root of machine learning is the actual data, and the way that data is labeled can potentially lead to bias as well. Bias can be intentional, and it can also be circumstantial.

“If you’re labeling a very subjective dataset in the morning before coffee and then again after coffee, you may get very different answers,” Malyuk said.

While it’s not always possible to make sure that data labeling processes are only executed by those that are fully caffeinated, there are processes that can help. Malyuk said what Label Studio does on the software side is it provides a way to build a process so that everyone contributes individually. The system identifies and builds all the matrices where it matches people with each other and how they label the same items. It’s an approach that Malyuk said can potentially identify bias for a specific label.

The open-source Label Studio technology is intended to be used by individuals and small groups, while the commercial project provides enterprise features for larger teams around security, collaboration and scalability.

“With open source, we focus on the user and we are trying to make the individual user’s life as easy as possible from a labeling perspective,” Malyuk said. “With the enterprise, we focus on the organization and whatever the business needs, there are.”

Originally appeared on: TheSpuzz