Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
Was data lakehouse platform Databricks becoming an OpenAI rival on anyone’s 2023 bingo card? Well, hello, Dolly.
Today, in an effort the company says is meant to build on their longtime mission to democratize AI for the enterprise, Databricks released the code for an open-source large language model (LLM) called Dolly — named after Dolly the sheep, the first cloned mammal — that it said companies can use to create instruction-following chatbots similar to ChatGPT.
The model can be trained, the company explained in a blog post, on very little data and in very little time. “With 30 bucks, one server and three hours, we’re able to teach [Dolly] to start doing human-level interactivity,” said Databricks CEO Ali Ghodsi.
There are many reasons a company would prefer to build their own LLM model rather than sending data to a centralized LLM provider that serves a proprietary model behind an API, the blog post explained. Handing sensitive data over to a third party may not be an option, while organizations may have specific needs as far as model quality, cost and desired behavior.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
“We believe that most ML users are best served long term by directly owning their models,” said the blog post.
Databricks found ChatGPT-like qualities don’t require latest or largest LLM
According the announcement, Databricks said Dolly is meant to show that anyone “can take a dated off-the-shelf open source large language model and give it magical ChatGPT-like instruction.” Surprisingly, it said, instruction-following does not seem to require the latest or largest models — Dolly is only 6 billion parameters, compared to 175 billion for GPT-3.
“We’ve been calling ourselves a data and AI company since 2013, and we have close to 1000 customers that have been using some kind of large language model on Databricks,” said Ghodsi, who told VentureBeat he was “blown away” when ChatGPT was launched at the end of November 2022, but realized only a few companies on the planet have the massive language models necessary for ChatGPT-level ability.
“Most people were thinking, do we have to all leverage these proprietary models that these very few companies have? And if so, do we have to give them our data?” he said.
The answer to both of those questions is no: In February, Meta released the weights for a set of high-quality (but not instruction-following) language models called LLaMA to academic researchers, trained for over 80,000 GPU-hours each. Then, in March, Stanford built the Alpaca model, which was based on LLaMA, but tuned on a small dataset of 50,000 human-like questions and answers that, surprisingly, made it exhibit ChatGPT-like interactivity.
Inspired by those two options, Databricks was able to take an existing open source 6 billion parameter model from EleutherAI and slightly modify it to elicit instruction following capabilities such as brainstorming and text generation not present in the original model, using data from Alpaca.
Surprisingly, the modified model worked very well. According to the blog post, this suggests that “much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models.”
LLM models will not be the hands of only a few companies
Ghodi said that going forward there will many more LLM models that will become cheaper and cheaper — and won’t be in the hands of only a few companies.
“Every organization on the planet will probably utilize these,” he said. “Our belief is that in every industry, the winning, leading companies will be data and AI companies that will be leveraging this kind of technology and will have these kinds of models.”