What enterprises can learn about data infrastructure from Cruise driverless cars

Head over to our on-demand library to view sessions from VB Transform 2023. Register Here


Developing safe driverless car technology is a highly specialized, complex and multifaceted undertaking — I know this firsthand, having recently worked for one of the small number of companies active in the sector.

Despite that, there are many lessons that enterprises across industries can learn from the driverless car industry, especially companies moving to embrace generative AI. Not least among them: How to build a robust and secure data infrastructure to support their AI models, according to Mo Elshenawy, executive vice president (EVP) of engineering at Cruise, General Motors’ (GM) self-driving car subsidiary.

“Data is the lifeline, and you work backward from there,” Elshenawy told me during our fireside chat at the VentureBeat Transform 2023 conference on Wednesday. “You’re going to find different [data] consumers across your organizations. Who needs the data and in what format they need it, and for how long? How soon do they need the data? So that’s a very important aspect to think about.”

>>Follow all our VentureBeat Transform 2023 coverage<<

Event

VB Transform 2023 On-Demand

Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.

 

Register Now

Elshenawy shared his view from under the hood at Cruise, which launched the first customer-facing driverless car service in a major city — San Francisco — in early 2022. Today, Cruise’s driverless Chevy Volts are a common sight in the City by the Bay, operating 24/7, though they are for now limited to those who have signed up for Cruise’s waiting list.

Cruise handles more data than most organizations across all types of sectors, giving the company a unique vantage point for what works in terms of data infrastructure, data pipelines and stress tests.

“Any given month, our Cruise engineers would be siphoning through some seven exabytes of data — equivalent to 150 million years of video streaming,” Elshenawy said.

As such, Cruise had to make sure its data infrastructure was robust enough to handle this incredible volume of data, but also smart enough to categorize it and make it easily accessible to those in the company who needed to access it — all while maintaining high, safety-critical security.

With vehicles capturing massive volumes of sensor data in real time, Cruise had to architect a data infrastructure from scratch that could handle the immense scale. Key considerations included scalability, security, cost optimization, and tooling to help engineers effectively leverage the data. 

From data lake to warehouse and lakehouse architecture

One of the most pressing questions facing any organization looking to use generative AI — or those dealing with any kind of software and digital data, in fact — is where and how to store all their data.

In the early days of personal computing and enterprise tech, digital “warehouses” were the answer. This meant putting structured data — organized data such as a spreadsheet, comma-separated-values file, or similar — into one system for keeping track of it all.

But as organizations began to collect and sought to analyze more unstructured data — such as customer interactions, code, and multimedia content like photos, videos and audio — it became incumbent upon them to find another way to store it all, especially given the vast and rapidly increasing quantities they were accumulating. That was how the data lake was born.

Finally, in the last few years, companies have moved to a hybrid data storage and retrieval architecture: the lakehouse architecture, which combines qualities of both structured and unstructured data and allows both types to be stored and retrieved in the same database.

Elshenawy said Cruise’s own data infrastructure journey actually followed the inverse of this trend, beginning with a data lake and adding a warehouse and a lakehouse as the company moved from coding to testing to public-facing driverless cars on public roads.

“At one point, in our life stage, it made perfect sense for us to just rely on a data lake because our main customers were our ML [machine learning] engineers,” Elshenawy said. “Then you move into another architecture, data warehouses. If you have a lake and a warehouse, you’re moving data around from one place to another. And once you get to that point, and you have like a two-tier data architecture, where you’re replicating your data, know for sure that you probably want to move into the new architecture of a lakehouse, where you still have one data lake, but you get the benefits of building a data warehouse on top of that, so you end up serving both customers really well.”

He advocated that organizations in other industries approach their tasks with a similarly flexible mentality, beginning with only the data infrastructure they need and changing it as the organization grows, or if members of the organization need different types of data infrastructure to accomplish the organization’s goals.

“You have ML engineers expecting streaming directly from a data lake, versus business intelligence analysts, they want a data warehouse.”

Making sure your AI models don’t overfit or underfit the real-world use cases

Though Cruise is not primarily in the business of developing, nor using, large language models (LLMs) such as Anthropic’s Claude 2 or OpenAI’s Chat GPT, Elshenawy did say there was one major challenge that LLM users and Cruise’s autonomous vehicle AI models shared: making sure the models don’t overfit or underfit — that is, that they are trained appropriately to respond to new, real-world data that they encounter that does not necessarily resemble their training data. This may include edge cases.

Underfitting is when the AI model did not learn well enough from the data it was trained on to recognize patterns, and is not able to produce the desired responses reliably when encountering real-world use case data that closely resembles the training data — no matter what the sector or industry.

Overfitting is when the AI model learned too well from the training data, and is flummoxed by new real-world data that does not match it, such as an edge case — an unusual event that does not happen frequently. The goal in the case of Cruise and those that use LLMs is to have an AI that is neither underfitted nor overfitted for its specific use case.

Elshenawy said Cruise accomplishes this through the use of several different data science and machine learning techniques, including data augmentation and synthetic data generation.

Drilling down specifically on augmentation, Elshenawy provided the example of Cruise cars currently testing by performing driverless trips in San Francisco on public roads.

“Because we’re starting with San Francisco … we see a lot of great odd things that happen” while driving around, Elshenawy explained. “You can take one of those examples and create thousands of variations [in software] … change lighting conditions, the angles, speed velocities of all the other vehicles and so on. So you create almost a new dataset augmented out of something that you saw.”

One odd thing that has been happening more frequently recently: protestors putting traffic cones on Cruise and Alphabet-backed rival Waymo’s driverless vehicles that are both testing in San Francisco, covering their sensors and causing them to stop in their tracks.

Elshenawy said that even though these protests are a kind of “edge case,” the Cruise AI models had been built resiliently enough to act safely even when these incidents occur.

“That is an example where actually our vehicles handle the situations very well because we’ve built a generalized model, and the safe thing if you cover a sensor or damage a sensor is for the vehicle to pull over, and and wait for someone to come in and clear that hazard.”

AI + LLM = AGI?

When asked about the prospect of combining autonomous driving systems with large language models (LLMs) to produce artificial general intelligence (AGI) Elshenawy was skeptical.

“I don’t think putting them together with directly lead to artificial general intelligence. Both are great in their own methods. Putting them together can have great advancements in human-robot interactions, but it’s not generally going to lead to that … what I’m excited about is how quickly both of them advance.”

Elshenawy also provided insight into Cruise’s rigorous approach to cybersecurity, essential for a safety-critical autonomous system.

“You truly need a multidisciplinary team, a team that spans across software engineers, data engineers, analysts, data scientists, security engineers,” he said.

The session offered a fascinating insider perspective on the data challenges overcome by one of the leaders in autonomous vehicles. As AI permeates more aspects of business and society, Cruise’s lessons on robust data infrastructure will only grow more relevant.

Originally appeared on: TheSpuzz

Scoophot
Logo