To further strengthen our commitment to providing industry-leading coverage of data technology, VentureBeat is excited to welcome Andrew Brust and Tony Baer as regular contributors. Watch for their articles in the Data Pipeline.
Starburst, provider of enterprise platform offerings for optimizing the Trino distributed SQL query engine, recently marked a milestone anniversary of the original open-source code family from which the engine’s development stems. Trino is a highly parallel, open-source distributed SQL query engine designed to perform interactive analytics on large volumes of data. VentureBeat spoke with co-creator Dain Sundstrom about the project’s growth and its future.
Open Source project lineage
Ten years ago, the original Presto/Trino open-source code family was started by Sundstrom and co-creators Martin Traverso, David Phillips and Eric Hwang, at Facebook, to solve the problem of analytics and querying at speed over Facebook’s large datasets. In 2018, the creators parted with Facebook and the original code family was split into two lineages, the one remaining under Facebook being called PrestoDB, and the one being focused on by the creators differentiated by the name PrestoSQL. In December, 2020, the PrestoSQL lineage of the code was rebranded to Trino, under which name this lineage of the code continues to be developed today.
The engine was originally created to perform querying at speed over massive datasets, and it has grown and been refined greatly since its early days. Features such as security, that hardly existed in the first few releases, are now core to the project. The ecosystem of tools and integrations supported has expanded, as has the number of data connectors. These include connectors to relational data sources such as PostgreSQL, Oracle and SQL Server, as well as non-traditional sources such as Elasticsearch, OpenSearch, MongoDB and Apache Kafka. Sundstrom described additional refinements currently in the works as including redesigning the function language for improved extensibility, improving support for ETL workloads and making this functionality work better, out-of-the box, to improve productivity for non-experts.
Sundstrom says the creators decided to open-source the project based on the shared open-source background among them. Some challenges they faced and overcame included growing and scaling the system out – not just the software, which is a difficult enough problem in and of itself, but also the community: helping open up communication between different members of the community to drive collaboration around solving a common problem, rather than solutions being developed to the same problem in parallel.
Trino use cases
Trino is used by many companies, including Netflix and LinkedIn, for internal analytics, and some of these companies also contribute to the open-source project, such as Bloomberg and Comcast. Sundstrom discussed how Trino is especially popular with real-time, internet dispatch/taxi-like services and food delivery services, including Lyft and DoorDash, because it can perform extremely fast low-latency queries over large datasets. Sundstrom mentioned that it also performs extremely well on geo-spatial data, which is becoming ever-more common, and can be difficult to analyze.
Future view of Trino
Looking to the future, Sundstrom said he is excited about Trino and its future, as the pace of innovation continues to accelerate and the use cases are able to cover expanded workloads and data types. He anticipates bigger growth in the problems Trino can approach — for example, adding the capability to process geospatial data means that mapping companies, cellular providers, and food delivery companies can derive added value from analyzing customer data.
The Trino community has already shown itself very capable of finding innovative solutions to its users’ problems. It’s hard to fathom that the Presto/Trino platforms are now 10 years old, but it’s easy to imagine Trino will become applicable to more use cases and user requirements over time.