Google releases differential privacy tools to commemorate Data Privacy Day

Did you miss a session from the Future of Work Summit? Head over to our Future of Work Summit on-demand library to stream.

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

AI promises to transform — and indeed, has already transformed — entire industries, from civic planning and health care to cybersecurity. But privacy remains an unsolved challenge. Spotlighting the issue, two years ago, Microsoft quietly removed a dataset with more than 10 million images of people after it came to light that some subjects weren’t aware that they’d been included in that data.

A partial solution to the problem of privacy in AI that’s been proposed is differential privacy. Differential privacy involves injecting a small amount of noise into data before feeding it into an AI system, thus making it difficult to extract the original data from the system. Someone seeing a differentially private AI system’s prediction can’t tell if a particular person’s information was used to develop the system.

In an effort to make differential privacy tools more accessible to more people, Google today announced the expansion of its existing differential privacy library to the Python programming language in partnership with OpenMined, an open source community focused on privacy-preserving technologies. The company also released a new differential privacy tool that it claims allows practitioners to visualize and better tune the parameters used to produce differentially private information, as well as a paper sharing techniques tho scale differential privacy to large datasets.

Expanding differential privacy

Google’s announcement marks both a year since it began collaborating with OpenMined and Data Privacy Day, which commemorates the January 1981 signing of Convention 108, the first legally binding international treaty dealing with data protection. Google open-sourced its differential privacy library — which the company claims is used in core products like Google Maps — in September 2019, before the arrival of Google’s experimental module that tests the privacy of AI models.

“In 2019, we launched our open-sourced version of our foundational differential privacy library in C++, Java and Go. Our goal was to be transparent, and allow researchers to inspect our code. We received a tremendous amount of interest from developers who wanted to use the library in their own applications, including startups like Arkhn, which enabled different hospitals to learn from medical data in a privacy-preserving way, and developers in Australia that have accelerated scientific discovery through provably private data,” Google differential privacy product lead Miguel Guevara wrote in a blog post. “Since then, we have been working on various projects and new ways to make differential privacy more accessible and usable.”

Google says that its deferential privacy library’s newfound support for Python has already enabled organizations to begin experimenting with novel use cases, such as showing a site’s most visited webpages on a per-country basis in “an aggregate and anonymized way.” As before, the library — which complements TensorFlow Privacy, Google’s differential privacy tool set for TensorFlow — can be used with data processing engines like Spark and Beam frameworks, yielding potentially more flexibility in deployment.

Growing support

Google is among several tech giants that have released differential privacy tools for AI in recent years. In May 2020, Microsoft debuted SmartNoise, which was developed in collaboration with researchers at Harvard. Not to be outdone, Meta (formerly Facebook) recently open-sourced a PyTorch library for differential privacy dubbed Opacus.

Studies underline the urgent need for techniques to conceal private data in the datasets used to train AI systems. Researchers have shown that even “anonymized” X-ray datasets can reveal patient identities, for example. And large language models like OpenAI’s GPT-3 are known to, when fed certain prompts, leak names, phone numbers, addresses, and more from training datasets.

Originally appeared on: TheSpuzz