In the HPC world, there is no resting on your laurels, no time to sit back and bask in a hard-won achievement that has taken years to build. The ticker tape has only just been cleaned up in the course of last year’s long-awaited celebration of finally reaching exascale computing, with the Frontier supercomputer at Oak Ridge National Labs breaking that barrier.
With that in the rearview mirror, attention turns to the next challenge: Zettascale computing, approximately 1,000 times faster than what Frontier is running. In the heady months following his announced return to Intel as CEO in 2021, Pat Gelsinger made headlines when he said the giant chipmaker is looking to 2027 to achieve Zettascale.
“So Zettascale in 2027 is a huge internal initiative that’s going to bring a lot of our technologies together,” Gelsinger said in October 2021 in response to a question from The Next Platform. “1,000X in five years? That’s pretty phenomenal.”
Lisa Su, the chief executive officer who led the remarkable turnaround at Intel’s main competitor AMD, took the stage at ISSCC 2023 to talk about Zettascale computing and present a much more conservative — some would say reasonable — timeline.
Looking at the performance trends of supercomputers over the past two decades and the ongoing innovations in computing – think advanced packaging technologies, CPUs and GPUs, chiplet architectures, the pace of AI adoption, among others – Su calculated that the industry could reach the zettabyte scale within the next 10 years or so.
“We achieved a very significant milestone just last year, the first exascale supercomputer,” she said during her presentation, noting that Frontier — built with HPE systems running on AMD chips — “uses a combination of CPUs and GPU used. There’s a lot of technology in there. We have been able to achieve exascale supercomputing, both from a performance standpoint and, more importantly, from an efficiency standpoint. Now let’s draw the line, assuming that [we can] Keep up the pace of innovation. … That’s a challenge we all need to think through. How could we achieve that?
The crux of the challenge will be energy efficiency. As data center servers double in performance every 2.4 years, HPC computing every 1.2 years, and GPUs every 2.2 years, server efficiency is beginning to decline.
For GPUs, it also eases off a bit:
Meanwhile, supercomputing efficiency is doubling every 2.2 years, but that still projects to a Zettascale system around 2035 consuming 500 megawatts at 2,140 gigaflops per watt.
“It’s not practical for us,” Su said. “It’s on the scale of a nuclear power plant, so it basically means our challenge is to figure out how we think about computational efficiency as a top priority over the next decade. A lot of work has been done across the industry, but this is the greatest challenge we must face in order to continue the dramatic increase in performance and capability that we have seen.”
It is possible. Not only was Frontier the world’s fastest computer—faster than the next six systems on the Top500 list combined—it also ranked second on the Green500 list of the world’s most efficient supercomputers.
However, there are challenges that make continued efficiency difficult, including slowing down Moore’s Law, making it harder to achieve both density performance and efficiency. Another reason is that IO doesn’t scale like logic. There have been energy gains per bit, but in large part that’s because IO distances are getting smaller. However, in much larger systems such as supercomputers, I/O continues to be a limiting efficiency factor. Here is an I/O power efficiency chart presented by Su:
Larger data sets and the bandwidth between processing power and memory also consume more memory access power.
“What do we have to do in the next ten years?” she asked. “It’s really about driving system-level efficiency holistically, by considering all the elements across compute, communication and storage that allow us to achieve the most efficient systems.”
The area AMD is most focused on is advanced architecture, with the goal of using “the right compute technology for the right workload,” Su said. “If you think about this whole discussion about heterogeneous architectures or accelerated computing, that’s really what we’re trying to do.”
Frontier uses AMD’s Instinct MI250 accelerator, a 6-nanometer GPU with more domain-specific architectural improvements for HPC and AI workloads, more integration with chiplets, and a 2.5D chiplet integration that brings high-bandwidth memory closer to processing power brings
Now that 3D chiplets are starting to take off in the industry, Su talked about the idea of stacking memory on top of the compute unit to reduce the amount of power the processor needs to access the memory.
“What it really does is that it allows us to bring the components of computing much closer together and lower the cost of communications,” the CEO said. “When you had these compute elements on a circuit board – and how far along [apart] they were and how much energy you had to expend on the communication between them – now you can have them stacked in either a 2D or 2.5D arrangement on the packaging or in a 3D arrangement and you just see a huge improvement in the whole communication efficiency.”
Domain-specific computing is another area she pointed out, essentially using the right tool for the right operations. Moving from double-precision floating to point to other math formats means more efficient computing, especially when AI and machine learning can be leveraged in the process to drive more automation. It also means more specific acceleration for certain applications.
It all adds up to the next-gen GPU, the Instinct MI300, which we talked about extensively here last fall, for both HPC and AI workloads.
“With the 5 nanometer process technology, with 3D stacking, with our cache actually stacked, the fabric die at the bottom, the CPUs and GPUs stacked at the top, with new math formats, with the different memory architecture, you can see improvements in the See order from 5x to 8x, whether it’s about efficiency or performance,” Su said.
Stacking will be important with CPUs and GPUs, which typically have their own memory cache, meaning data has to be moved around the processor if you want to share it. The MI300’s CDNA 3 APU architecture includes a unified memory architecture that makes data access more energy efficient by eliminating the redundant memory copies required in the MI250 and its separate memory caches.
Su touched on other innovations to come in areas like storage and compute stacking.
“What we’ve demonstrated so far is SRAM stacking on computer chips,” she said. “We put that into production and it has a significant improvement on certain workloads, not all workloads. But here are more ways to stack DRAM on compute power and stack other types of memory.”
AMD is also working with Samsung to bring the processing to memory, which Su admitted “as a processor person, it seems a little counterintuitive.” However, there are some processing operations that can likely be inserted into memory. Research teams from AMD and Samsung found that placing a few algorithmic kernels on top of memory can reduce overall access power by up to 85 percent.
“The work here is just focused on the individual components, but it’s also how applications would use this technology,” she said. “This is an area where a lot of cross-functional learning needs to happen.”
In addition, the chipmaker is working with DARPA on co-packaging optical communication technologies to improve IO efficiency. Chiplet and condensed packaging address local communications, but more needs to be done to make longer-range I/O more efficient. One possibility would be closer integration between the optical receivers and the computing chip in an optical assembly.
Ultimately, the goal is to move towards a system-in-package architecture, with the package becoming the new motherboard, which includes everything from CPUs and accelerators to memory and optics, Su said.
“That requires us to think differently on many levels,” she said. “From a computing power perspective, our goal would be to optimize each of these cores to be the best they can be. Whether it’s a CPU, or a GPU, or a domain-specific accelerator, or you’re using an ASIC to do machine learning — either training or inference — you can have each of these compute cores tuned, but they should be tuned by different people can become.”
Standardization of chip-to-chip interfaces becomes important as components can be mixed and matched.
AI will also play an increasingly important role in all of this, apart from being a tool for common problems like training very large models. One area would be the creation of AI replacement physics models. For complex physics problems, the traditional solution is to run CFD models on huge data sets.
However, AI-accelerated HPC “is the idea that you can do some of the physics skill with traditional HPC computing and then actually train on that data and then infer on that data to shorten cycles,” Su said. “Then maybe you haven’t gotten to the right answer yet, so you would practice modeling on a different set – probably a smaller set. It’s kind of a hybrid workflow.”
It’s early days and work needs to be done to find the right algorithms and determine how to fix the problems, but it would mean bringing more algorithmic thinking to system-level optimization. If the industry wants to improve energy efficiency enough to make zettascale computing a reality, these must be the things that need to be done, she said.