Large Language Models Use Triton for AI Inference

Julien Salinas wears many hats. He is an entrepreneur, software developer and, until recently, a volunteer firefighter in his mountain village an hour’s drive from Grenoble, a technology hub in south-eastern France.

He promotes a two-year-old startup, NLP Cloud, which is already profitable, employs about a dozen people, and serves customers around the world. It’s one of many companies worldwide using NVIDIA software to deliver some of today’s most complex and powerful AI models.

NLP Cloud is an AI-powered software service for text data. A large European airline uses it to summarize Internet news for its employees. A small healthcare company uses it to analyze patient requests for prescription refills. An online app uses them to let kids talk to their favorite cartoon characters.

Great language models speak volumes

It’s all part of the magic of natural language processing (NLP), a popular form of AI that produces some of the largest neural networks in the world, known as large language models. Trained on powerful systems with huge data sets, LLMs can do all sorts of tasks like recognizing and generating text with amazing accuracy.

NLP Cloud uses about 25 LLMs today, the largest has 20 billion parameters, a key measure of a model’s maturity. And now it implements BLOOM, an LLM with a whopping 176 billion parameters.

Running these massive models efficiently in production across multiple cloud services is hard work. For this reason, Salinas turns to NVIDIA Triton Inference Server.

High throughput, low latency

“Very quickly, the biggest challenge for us was server costs,” said Salinas, proud that his self-funded startup hasn’t received any outside support to date.

“Triton has proven to be a great way to take full advantage of the GPUs that we have,” he said.

For example, NVIDIA A100 Tensor Core GPUs can handle up to 10 requests simultaneously — twice the throughput of alternative software — thanks to FasterTransformer, a part of Triton that automates complex tasks like splitting models across many GPUs.

FasterTransformer also helps NLP Cloud distribute jobs that require more memory across multiple NVIDIA T4 GPUs while reducing the response time for the task.

Customers who demand the fastest response times can process 50 tokens – text elements such as words or punctuation marks – in just half a second with Triton on an A100 GPU, about a third of the response time without Triton.

“That’s very cool,” said Salinas, who has reviewed dozens of software tools on his personal blog.

Touring Tritons users

Around the world, other startups and established giants use Triton to get the most out of LLMs.

Microsoft’s translation service helped disaster workers understand Haitian Creole while responding to a 7.0 magnitude earthquake. It was one of many use cases for the service, which achieved 27x speedup using Triton to run inference on models with up to 5 billion parameters.

NLP provider Cohere was founded by one of the AI ​​researchers who wrote the landmark paper that defined transformer models. Using Triton on its custom LLMs speeds up inference by up to 4x, allowing users of customer support chatbots, for example, to get quick answers to their questions.

NLP Cloud and Cohere are among the many members of the NVIDIA Inception program that nurtures cutting-edge startups. Several other Inception startups also use Triton for AI inference on LLMs.

Tokyo-based company Rinna has created chatbots used by millions in Japan, as well as tools that allow developers to create custom chatbots and AI-powered characters. Triton helped the company achieve less than two seconds of inference latency on GPUs.

Based in Tel Aviv, Tabnine runs a service that automates up to 30% of the code written by a million developers worldwide (see demo below). Its service runs multiple LLMs on A100 GPUs with Triton to handle more than 20 programming languages ​​and 15 code editors.

Twitter uses Writer’s LLM service, based in San Francisco. It ensures that social network employees write in a voice that conforms to the company’s style guide. Writer’s Service achieves 3x lower latency and up to 4x higher throughput with Triton compared to previous software.

If you want to put a face to those words, Inception member Ex-human, just around the corner from Writer, helps users create realistic avatars for games, chatbots, and virtual reality applications. With Triton, it delivers sub-second response times on a 6 billion parameter LLM while reducing GPU memory consumption by a third.

A full-stack platform

Back in France, NLP Cloud now leverages other elements of the NVIDIA AI platform.

For inferencing models running on a single GPU, NVIDIA TensorRT software is used to minimize latency. “We get lightning-fast performance with it, and the latency really goes down,” said Salinas.

The company also began training custom versions of LLMs to support more languages ​​and increase efficiency. For this work, it uses NVIDIA Nemo Megatron, an end-to-end framework for training and serving LLMs with trillions of parameters.

Salinas, 35, has the energy of a 20-year-old to code and grow his business. He describes plans to build private infrastructure to complement the four public cloud services the startup uses, as well as expand to LLMs that process speech and text-to-image to address applications like semantic search.

“I’ve always loved coding, but being a good developer isn’t enough: you have to understand the needs of your customers,” says Salinas, who has posted code on GitHub nearly 200 times in the past year.

If you’re passionate about software, get the latest on Triton in this technical blog.