The innovative approach to cooling behind NVIDIA’s $5M COOLERCHIPS grant

Grace Hopper Superchip CPU for Generative AI.SOPA Images/LightRocket via Getty Images

Cooling a data center was already a challenge before the current AI-driven boom in accelerated computing took off. Servers are running hot, and by 2025 processor thermal ratings will reach 500 watts. When you add GPUs, some of which are approaching 700W today, the problems of power consumption and heat dissipation start to increase exponentially. Traditional cooling technologies limit an IT organization’s ability to deliver solutions. It affects business.

This impact goes beyond simply increasing compute density within a rack. it can fundamentally affect the bottom line. Analysts estimate that data centers worldwide account for 1.5 to 2% of global energy consumption. That’s a significant carbon footprint hanging from companies, almost all of which have sustainability goals. Beyond sustainability, however, there are costs. Up to 40% of a data center’s energy consumption is directly related to cooling. That’s a big bill.

The US Department of Energy announced its COOLERCHIPS program late last year as part of its Advanced Research Projects Agency-Energy (ARPA-E) to address the problem of data center cooling. Last month, the agency awarded grants totaling $40 million to 15 organizations.

Each grantee takes a novel approach to solving the data center cooling problem. Grant amounts ranged from $1.2 million to $5 million. NVIDIA received the largest of these grants, $5 million, to pursue a unique combination of concepts that address cooling within a computer’s chassis.

NVIDIA, the data center infrastructure company

It’s no surprise that NVIDIA is interested in data center cooling. Its CEO, Jensen Huang, often talks about NVIDIA becoming a data center company and posits that the data center is the new unit of computation. This is not just a visionary language; NVIDIA has been working hard to assemble (and acquire) almost all of the technology elements needed to make this vision a reality.

NVIDIA’s efforts are paying off. Recent earnings showed that its data center business accounts for over 68% of its total revenue, bringing in $4.3 billion in the first quarter of 2023. The current boom in AI-related infrastructure, coupled with the launch of several relevant new products by NVIDIA, mean that the company is forecasting non-linear growth in the short term.

NVIDIA’s traditional approach to delivering accelerated computing power is based on a traditional add-in card model, where NVIDIA GPUs and accelerated networking products are sold independently and built into a server built by someone else. NVIDIA extended this model with the introduction of its DGX system in 2018. DGX is a turnkey AI solution for accelerated computing. However, DGX was just the beginning as the company continues to ramp up its system-level efforts.

However, last quarter NVIDIA announced other platform-level turnkey solutions, including the new DGX Cloud for hyperscalers and the upcoming OmniVerse Cloud. Cooling these systems will remain an ongoing challenge even as direct liquid cooling solutions become more popular among server vendors. Solving large AI challenges requires many powerful processors packed as tightly as possible to achieve maximum density. It’s a problem that needs a solution.

NVIDIA’s hybrid cooling approach

NVIDIA has been unusually reticent about its cooling efforts, declining several requests to answer questions about the grant or to talk about high-performance accelerator cooling in general. Still, there’s enough public information to understand the direction NVIDIA is taking – and it’s an intriguing approach that builds on several existing cooling technologies.

The NVIDIA COOLERCHIPS application describes a system that combines two proven approaches: direct liquid cooling (DLC) and immersion cooling. DLC is a well-trodden field with numerous solutions on the market. However, the effectiveness of the DLC approach is limited as the power density increases.

Immersion cooling, where the electronics are submerged in a dialectic or other liquid, can effectively enable high-density computing power. However, current approaches to immersion cooling typically require submerging servers in large liquid-filled tanks. While this approach works in many scenarios, such as edge installations, it can be cumbersome to implement in a traditional rack-oriented data center.

NVIDIA describes its COOLERCHIPS approach as a blend of these technologies. NVIDIA will use conventional DLC attached to the CPUs and accelerators in the system while also filling the case with liquid that will turn the server into a plunge pool. This allows the temperature zones within the system to be managed independently, all in a single solution housed in a traditional rack.

NVIDIA’s COOLERCHIPS cooling approach

NVIDIA

NVIDIA is not pursuing this project alone, but is using the expertise of seven technology and research partners. NVIDIA’s in-house team of around a dozen engineers works with BOYD Corporation for cold plate technology, Durbin Group for the pumping system, Honeywell for fluid selection, and Vertiv Corporation for heat dissipation technology. The company also uses Binghamton and Villanova Universities to help with analysis, testing and simulation, and also works with Sandia National Laboratory for reliability assessment.

NVIDIA said in a blog post that its COOLERCHIPS project will deliver three annual milestones. The unit tests will be completed in the first year. The following year, part of the rack is evaluated, and by the end of the third year, a fully system-tested solution is ready.

Analyst Opinion

New approaches to cooling the accelerator-rich, AI-supported data center must be found. Current practices introduce real operational complexities and add costs that can dramatically impact a company’s bottom line. Therefore, solving the data center cooling challenges is paramount for the future of accelerated computing.

NVIDIA is far from alone in exploring innovative solutions to data center cooling challenges. The Open Compute Project (OCP) has long had a working group focused on cooling technology, and several exciting offshoots have sprung up. Each Tier 1 server OEM offers a variant of a rack-level liquid-cooled solution. And there are numerous players focusing on single and dual phase immersion cooling.

However, NVIDIA is almost the only one among its competitors when it comes to finding new cooling solutions. While Intel Corporation is researching various approaches, including liquid immersion cooling, the company has shut down the $700 million liquid cooling research facility that was planned to be built in Oregon earlier this year.

NVIDIA and the other COOLERCHIPS grantees know that current solutions to data center cooling challenges are limited and stopgap at best. NVIDIA’s COOLERCHIPS approach combines elements of known effective cooling approaches, such as direct liquid cooling, with an interesting new approach of immersion cooling.

If NVIDIA can deliver a solution that keeps pace with increased power densities without forcing IT architects to rethink infrastructure, the company will win. I’m excited to see what NVIDIA and its COOLERCHIPS partners deliver. And so are many data center architects.

Disclosure: Steve McDowell is an industry analyst and NAND Research is an industry analyst firm that provides or has provided research, analysis and consulting services to many technology companies, which may include those discussed in this article. Mr. McDowell has no stock positions in any of the companies mentioned in this article.

Follow me on Twitter or LinkedIn.

Steve McDowell is principal analyst and founding partner at NAND Research. Steve is a technologist with over 25 years of extensive industry experience in a variety of strategy, engineering and strategic marketing roles, all with a shared goal of bringing innovative technologies to the enterprise infrastructure market.

Read moreRead less