Analysts estimate that by 2025, 30% of generated data will be real-time data. That is 52 zettabytes (ZB) of real-time data per year – roughly the amount of total data produced in 2020. Since data volumes have grown so rapidly, 52 ZB is three times the amount of total data produced in 2015. With this exponential growth, it’s clear that conquering real-time data is the future of data science.
Over the last decade, technologies have been developed by the likes of Materialize, Deephaven, Kafka and Redpanda to work with these streams of real-time data. They can transform, transmit and persist data streams on-the-fly and provide the basic building blocks needed to construct applications for the new real-time reality. But to really make such enormous volumes of data useful, artificial intelligence (AI) must be employed.
Enterprises need insightful technology that can create knowledge and understanding with minimal human intervention to keep up with the tidal wave of real-time data. Putting this idea of applying AI algorithms to real-time data into practice is still in its infancy, though. Specialized hedge funds and big-name AI players – like Google and Facebook – make use of real-time AI, but few others have waded into these waters.
To make real-time AI ubiquitous, supporting software must be developed. This software needs to provide:
- An easy path to transition from static to dynamic data
- An easy path for cleaning static and dynamic data
- An easy path for going from model creation and validation to production
- An easy path for managing the software as requirements – and the outside world – change
An easy path to transition from static to dynamic data
Developers and data scientists want to spend their time thinking about important AI problems, not worrying about time-consuming data plumbing. A data scientist should not care if data is a static table from Pandas or a dynamic table from Kafka. Both are tables and should be treated the same way. Unfortunately, most current generation systems treat static and dynamic data differently. The data is obtained in different ways, queried in different ways, and used in different ways. This makes transitions from research to production expensive and labor-intensive.
To really get value out of real-time AI, developers and data scientists need to be able to seamlessly transition between using static data and dynamic data within the same software environment. This requires common APIs and a framework that can process both static and real-time data in a UX-consistent way.
An easy path for cleaning static and dynamic data
The sexiest work for AI engineers and data scientists is creating new models. Unfortunately, the bulk of an AI engineer’s or data scientist’s time is devoted to being a data janitor. Datasets are inevitably dirty and must be cleaned and massaged into the right form. This is thankless and time-consuming work. With an exponentially growing flood of real-time data, this whole process must take less human labor and must work on both static and streaming data.
In practice, easy data cleaning is accomplished by having a concise, powerful, and expressive way to perform common data cleaning operations that works on both static and dynamic data. This includes removing bad data, filling missing values, joining multiple data sources, and transforming data formats.
Currently, there are a few technologies that allow users to implement data cleaning and manipulation logic just once and use it for both static and real-time data. Materialize and ksqlDb both allow SQL queries of Kafka streams. These options are good choices for use cases with relatively simple logic or for SQL developers. Deephaven has a table-oriented query language that supports Kafka, Parquet, CSV, and other common data formats. This kind of query language is suited for more complex and more mathematical logic, or for Python developers.
An easy path for going from model creation and validation to production
Many – possibly even most – new AI models never make it from research to production. This hold up is because research and production are typically implemented using very different software environments. Research environments are geared towards working with large static datasets, model calibration, and model validation. On the other hand, production environments make predictions on new events as they come in. To increase the fraction of AI models that impact the world, the steps for moving from research to production must be extremely easy.
Consider an ideal scenario: First, static and real-time data would be accessed and manipulated through the same API. This provides a consistent platform to build applications using static and/or real-time data. Second, data cleaning and manipulation logic would be implemented once for use in both static research and dynamic production cases. Duplicating this logic is expensive and increases the odds that research and production differ in unexpected and consequential ways. Third, AI models would be easy to serialize and deserialize. This allows production models to be switched out simply by changing a file path or URL. Finally, the system would make it easy to monitor – in real time – how well production AI models are performing in the wild.
An easy path for managing the software as requirements – and the outside world – change
Change is inevitable, especially when working with dynamic data. In data systems, these changes can be in input data sources, requirements, team members and more. No matter how carefully a project is planned, it will be forced to adapt over time. Often these adaptations never happen. Accumulated technical debt and knowledge lost through staffing changes kill these efforts.
To handle a changing world, real-time AI infrastructure must make all phases of a project (from training to validation to production) understandable and modifiable by a very small team. And not just the original team it was built for – it should be understandable and modifiable by new individuals that inherit existing production applications.
As the tidal wave of real-time data strikes, we will see significant innovations in real-time AI. Real-time AI will move beyond the Googles and Facebooks of the world and into the toolkit of all AI engineers. We will get better answers, faster, and with less work. Engineers and data scientists will be able to spend more of their time focusing on interesting and important real-time solutions. Businesses will get higher-quality, timely answers from fewer employees, reducing the challenges of hiring AI talent.
When we have software tools that facilitate these four requirements, we will finally be able to get real-time AI right.
Chip Kent is the chief data scientist at Deephaven Data Labs.