Google’s PaLM 2 uses almost five times more text data than its predecessor

Google’s large language model, PaLM 2, uses almost five times as much text data for training as its predecessor, LLM, CNBC has learned. When announcing PaLM 2 last week, Google said the model is smaller than the earlier PaLM but uses more efficient “technology. “The lack of transparency of training data in artificial intelligence models is becoming a hot topic among researchers.

Alphabet Inc. CEO Sundar Pichai during the Google I/O Developers Conference in Mountain View, Calif. on Wednesday, May 10, 2023.

David Paul Morris | Bloomberg | Getty Images

Google’s new large language model, which the company announced last week, uses almost five times as much training data as its 2022 predecessor, allowing it to perform more complex programming, math and creative writing tasks, CNBC has learned.

PaLM 2, the company’s new general-purpose Large Language Model (LLM) unveiled at Google I/O, is trained on 3.6 trillion tokens, according to internal documentation viewed by CNBC. Tokens, which are sequences of words, are an important building block for training LLMs because they teach the model to predict the next word that will appear in a sequence.

Google’s previous version of PaLM, which stands for Pathways Language Model, was released in 2022 and trained to 780 billion tokens.

While Google was eager to demonstrate the power of its artificial intelligence technology and how it’s embedded in search, email, word processing, and spreadsheets, the company was reluctant to release the size or other details of its training data. OpenAI, the Microsoft-backed developer of ChatGPT, has also kept the details of its latest LLM, called GPT-4, secret.

According to the companies, the reason for the lack of disclosure is the competitive nature of the business. Google and OpenAI are rushing to lure users who may want to search for information using conversational chatbots instead of traditional search engines.

But as the AI ​​arms race intensifies, the research community is demanding more transparency.

Since unveiling PaLM 2, Google has said the new model is smaller than previous LLMs, which is significant because it means the company’s technology will become more efficient while also handling more demanding tasks. According to internal documents, PaLM 2 is trained on 340 billion parameters, an indication of the complexity of the model. The initial PaLM was trained on 540 billion parameters.

Google did not immediately comment on this story.

Google said in a PaLM 2 blog post that the model uses a “new technique” called “machine-optimal scaling.” This makes the LLM “more efficient with better overall performance, including faster inference, fewer parameters to deploy, and lower deployment costs.”

In announcing PaLM 2, Google confirmed CNBC’s previous reporting that the model is trained in 100 languages ​​and performs a wide range of tasks. It’s already being used across 25 features and products, including the company’s experimental chatbot, Bard. It comes in four sizes, from the smallest to the largest: gecko, otter, bison, and unicorn.

According to public statements, PaLM 2 is more powerful than any existing model. Facebook’s LLM called LLaMA, announced in February, is trained to 1.4 trillion tokens. The last time OpenAI shared ChatGPT’s training size was with GPT-3, when the company said it was training to 300 billion tokens at the time. OpenAI released GPT-4 in March and said it showed “human-level performance” on many professional tests.

LaMDA, a conversational LLM that Google launched two years ago and touted alongside Bard in February, has been trained to 1.5 trillion tokens, according to the latest documents viewed by CNBC.

As new AI applications quickly reach the mainstream, controversies surrounding the underlying technology only heat up.

El Mahdi El Mhamdi, a senior Google research scientist, resigned in February over the company’s lack of transparency. On Tuesday, OpenAI CEO Sam Altman testified at a Senate Judiciary Subcommittee on Privacy and Technology hearing, agreeing with lawmakers that a new system to deal with AI is needed.

“For a very new technology, we need a new framework,” Altman said. “Certainly, companies like ours have a great responsibility for the tools we make available to the world.”

– CNBC’s Jordan Novet contributed to this report.

WATCH: OpenAI CEO Sam Altman calls for AI oversight