The importance of data audits when building AI

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – August 3. Join AI and data leaders for insightful talks and exciting networking opportunities. Learn more about Transform 2022

Artificial intelligence can do a lot to improve business practices, but AI algorithms can also introduce new avenues of risk. For example, consider Zillow’s recent shutdown of Offers, the branch of the company dedicated to buying fixer uppers, after its prediction models significantly overshot house values. When housing price data changed unpredictably, the group’s machine-learning models didn’t adapt quickly enough to account for the volatility, resulting in significant losses. This type of data mismatch or “concept drift” happens if you don’t give proper care and respect to data audits.

Zillow’s failure to properly audit its data didn’t just hurt the company; it could have caused wider damage by scaring other businesses away from AI. Negative perceptions of a technology can halt its progress in the commercial world, especially for a category like AI that already went through several winters. Machine-learning pioneers like Andrew Ng recognize what hangs in the balance and have started campaigns to emphasize the importance of data audits by doing things like holding an annual competition for the best data quality assurance methods (instead of picking winners based just on model as it’s traditionally been done). 

Beyond my own work to build AI, as host of The Robot Brains podcast, I’ve also interviewed dozens of AI practitioners and researchers about their approach to auditing and maintaining high-quality data. Here are some of best practices I’ve compiled from that work:

  • Beware of outsourcing your data curation and labeling. Data maintenance isn’t the sexiest task and it’s time intensive. When time is short, as it is for most entrepreneurs, it’s tempting to outsource the responsibility. But beware of the risks that come with it. A third-party vendor won’t be as intimately familiar with your product vision, know contextual nuances, or have the personal incentives to keep the close reins that are required. Andrej Karpathy, head of AI for Tesla, says that he uses 50% of his own time on maintaining the vehicles’ data playbooks because it’s that important. 
  • If your data is incomplete, address the gaps. All is not lost if your data sources reveal gaps or potential areas for erroneous prediction. One source that’s often problematic is demographic data. As we know, historical demographic data sources tend to skew towards white males, and that can bias your entire model. Princeton professor and co-founder of AI4All, Olga Russakovsky, created the REVISE model, which brings to light patterns of correlations (possibly spurious) in visual data. You can use the model to request insensitivity to these patterns or decide to collect more data that doesn’t have the patterns. (Here is the code to run the model if you want to use it.) Demographic data is most often cited in this type of situation (i.e. medical history data has traditionally had a higher percentage of information about Caucasian males), but it can be applied in any scenario.
  • Understand the implications of sacrificing intelligence for speed. Your data audit may motivate you to plug in larger data sets with more complete coverage. In theory, that might seem like a great strategy, but it can actually be a mismatch for the business goal at hand. The larger the data set, the slower the analysis. Is that extra time justified by the value of the increased insight?

    Financial services companies have had to ask themselves this question quite often given the massive dollar amounts at play and the industry’s technology getting faster and faster (think nanoseconds.) Mike Schuster, head of AI at financial services firm Two Sigma, shared that it is important to keep in mind that a more precise model, driven by more data, can often result in longer inference times during deployment, possibly not meeting your need for speed. Vice versa, if you make longer horizon decisions, you’ll have to compete with others in the market who incorporate much larger amounts of data, so you will have to do the same to be competitive.

Applying AI models to solve business problems is becoming common as the open-source community makes them freely available to all. The downside becomes that as AI-generated insights and predictions become the status quo, the less flashy work of data maintenance can get overlooked. It’s like building a house on sand. It may look fine initially, but as time passes, the structure will collapse.

Professor Pieter Abbeel is Director of the Berkeley Robot Learning Lab and Co-Director of the Berkeley Artificial Intelligence (BAIR) Lab. He has founded three companies: Covariant (AI for intelligent automation of warehouses and factories), Gradescope (AI to help teachers with grading homework and exams), and Berkeley Open Arms (low-cost 7-dof robot arms). He also hosts the podcast The Robot Brains.

Originally appeared on: TheSpuzz