Did you miss a session at the Data Summit? Watch On-Demand Here.
“Head’s up. Conversations like this can be intense. Don’t forget the human behind the screen.”
Twitter’s dialog warning is the latest in a longtime battle to help us be more civil to one another online. Perhaps more disturbing is the fact that we train large-scale AI language models with data from often toxic online conversations. No wonder we see the bias reflected back to us in machine-generated language. What if, as we’re building the metaverse – effectively the next version of the web – we use AI to filter toxic dialogue for good?
A Facetune for language?
Right now, researchers are doing a lot with AI language models to tune their accuracy. In multilingual translation models, for example, a human in the loop can make a huge difference. Human editors can check that cultural nuances are properly reflected in a translation and effectively train the algorithm to avoid similar mistakes in the future. Think of humans as a tuneup for our AI systems.
If you imagine the metaverse as a sort of scaled-up SimCity, this type of AI translation could instantly make us all multilingual when we talk to one another. A borderless society could level the playing field for people (and their avatars) who speak less common languages and potentially promote more cross-cultural understanding. It could even open up new opportunities for international commerce.
There are serious ethical questions that come with using AI as a Facetune for language. Yes, we can introduce some control on the style of language, flag cases where models aren’t performing as expected, or even modify literal meaning. But how far is too far? How do we continue to foster diversity of opinion, while limiting abusive or offensive speech and behavior?
A framework for algorithmic fairness
One way to make language algorithms less biased is to use synthetic data for training in addition to using the open internet. Synthetic data can be generated based on relatively small “real” datasets.
Synthetic datasets can be created to reflect the population of the real world (not just the ones that speak the loudest on the internet). It’s relatively easy to see where the statistical properties of a certain dataset are out of whack and thus where synthetic data could best be deployed.
All of this begs the question: Is virtual data going to be a critical part of making virtual worlds fair and equitable? Could our decisions in the metaverse even impact how we think about and speak to each other in the real world? If the endgame of these technological decisions is more civil global discourse that helps us understand each other, synthetic data may be worth its algorithmic weight in gold.
Yet, however tempting it is to think that we can press a button and improve behavior to build a virtual world in an all-new image, this isn’t a matter technologists alone will decide. It’s unclear whether companies, governments, or individuals will control the rules governing fairness and behavioral norms in the metaverse. With many conflicting interests in the mix, it would be wise to listen to leading tech experts and consumer advocates about how to proceed. Perhaps it is blue sky thinking to assume there will be a consortium for collaboration between all competing interests, but it is imperative we create one, in order to have a discussion about unbiased language AI now. Every year of inaction means dozens — if not hundreds — of metaverses would need to be retrofitted to meet any potential standards. These issues surrounding what it means to have a truly accessible virtual ecosystem require discussion now before there is mass adoption of the metaverse, which will be here before we know it.
Vasco Pedro is a Co-Founder and CEO of AI-powered language operations platform Unbabel. He spent over a decade in academic research focused on language technologies and previously worked at Siemens and Google, where he helped develop technologies to further understand data computation and language.