Relief for the Solution Architect: Pushing Back on HPC Cluster Complexity with Warewulf and Apptainer – High-Performance Computing News Analysis

At the heart and through the training of a research scientist, financial analyst, or product design engineer performing multiphysics CAE, how did you end up as… a sysadmin? You wanted to be one thing and you became something else entirely. They finished school and started working with some vigorous clusters of the HPC class. One day there’s a system problem and you, poor soul, step up and build in a solution. Someone – probably an older person – compliments you: “Wow, that’s impressive. man I could never figured it out myself…” something like that.

Word gets around and it doesn’t take long before you’re the go-to person if something goes wrong with the cluster, which it often does. Soon you’re sitting in front of a series of screens monitoring the system while everyone else is doing science, balancing hedge fund portfolios, or simulating cool new product designs. And you may be wondering, “Well, how did I get here?”**

Organizations that rely on clusters—be they 100 nodes or 1,000—would be nowhere without system administrators, also known as solution architects. It’s mind-bending, painstaking work that lacks real glamour use clusters. But everyone from the CEO on up knows that without good solution architects, their organizations would grind to a halt.

And there is far from enough of that. Clusters are larger, more complicated, more powerful, and more heterogeneous than ever before, and they’re becoming increasingly difficult to manage as they take on larger and more complex tasks.

“You don’t start out thinking, ‘I’m going to get into cluster system administration,'” Glen Otero, a Ph.D. is the Director of Scientific Computing, Genomics AI and Machine Learning at CIQ, a technology company with expertise in HPC-class clusters. “You start out as someone who is going to do something big in science. But you end up in this room because – we joke about it – you voluntarily built the system. And then when you do it, it’s like, ‘Hey, can you do that too? Can you do that too?’ And then one day you wake up and you’re like, ‘Where has my life gone? I should do some research.’”

CIQ at SC22

For as long as clusters have existed, providing and managing clusters has required solutions that smooth out and automate these processes – at least partially automate them. Three prominent open source projects have taken on cluster complexities, all three are the brainchild of Greg Kurtzer, the founder and CEO of CIQ. The three projects are:

– The Rocky Linux operating system based on the CentOS Linux distribution launched by Kurtzer and for which Red Hat withdrew support in December 2020 (see related InsideHPC story), widely used by organizations that need large , build complex clusters of the HPC class .

– Warewulf, a cluster deployment solution developed by Kurtzer starting in 2001 when he was running Linux clusters at Lawrence Berkeley National Laboratory for the Department of Energy.

– Apptainer, also developed in Kurtzer’s Berkeley Lab, is a secure, high-performance container system for applications that began as “Singularity”, an HPC-tailored answer to Docker.

Kurtzer created CIQ to provide support, services, tools and other value to Rocky Linux, Warewulf and Apptainer, and is a driving force behind the open source communities that contribute to the three projects. CIQ provides traditional HPC-related solutions and support, and is behind a computing paradigm leading the way to cloud-native, hybrid, federated computing called HPC-2.0 (discussed in a later article on this site).

Gregory Kurtzer

“Building and running clusters is difficult, there is no way around it,” said Brock Taylor, vice president of high performance computing and strategic partners at CIQ. “A cluster consists of thousands of components. If you add all the hardware and software together, the operating system alone contains a lot of things. It takes a lot of effort to get there, a lot of expertise.”

When Beowulf clusters started in the early 1990s, deployment was scripted, hands-on, and do-it-yourself. Tools soon became available, open source tools like Oscar, Rocks and Warewulf.

“So you have these deployment systems that help simplify the deployment of clusters,” Taylor said, “but over time the complexity increases. It’s like entropy, right? With clusters, it never gets easier, it gets harder. The complexity always precedes the solution.”

Commercial software offerings also emerged, such as those from Platform Computing, which was largely based on Rocks and later acquired by IBM, and Bright Computing, which NVIDIA added to its enterprise stack last January.

But for proponents of the open-source movement, it’s beneficial that Warewulf and Apptainer remain community-supported and vendor-neutral. However, they are not panaceas – cluster entropy always remains, and there are problems of not having enough system architects to meet the demand, especially those who can successfully wade into the HPC cluster alligator pit.

“It’s a big problem with HPC,” Taylor said. “It’s a shrinking pool of finding people who can keep on top of all the technology and keeping them. And as they gain more experience managing HPC systems, their price can increase and they have many opportunities to go elsewhere.”

Warewulf helps in part with cluster management, simplifying the process of adding new cluster nodes through the use of “images” where, as Taylor put it, “all the magic happens”. Images contain a complete software stack, a “golden snapshot” of resources – the software that powers computing, storage, networking, all – within a node. Images allow for the addition of new nodes that are an exact copy of the other nodes it will be working with, and ensure that all “piping and cables are connected correctly and consistently,” Taylor said, “which is quite a difficult task.” .

In Rocky Linux Warewulf apptainer shops, Warewulf images are delivered as containers to spin up compute nodes in the cluster. These can also contain variations of existing cluster nodes – for example, one node with GPUs and CPUs, while the other nodes are only CPUs – but can still function as part of the cluster.

Jonathan Anderson, CIQ’s lead HPC solution architect, describes why the combination of Apptainer and Warewulf is a powerful combination.

“Apptainer brings scientific computing end users into the container ecosystem and gives them full control over the operating environment in which their applications run,” he said. Warewulf 4 brings cluster administrators into the same container ecosystem by basing compute node images on standard OS containers. By bringing users and admins together in the same ecosystem, they can work better together and build on each other’s work.”

This is where CIQ can play an invaluable role in HPC shops. The company not only has expertise at the base level of the operating system, but also with Warewulf and Apptainer.

“Warewulf helps you keep your compute node software consistent while all your individual users run different applications, ‘snowflake’ applications, in containers,” Otero said. Combining the three (Rocky, Apptainer, Warewulf) into an integrated whole means organizations can quickly and easily build and grow clusters at scale.

“Applications run in containers, and because they’re platform-agnostic – because everything is packaged in a container – it allows the administrator to manage these nodes as if they were all the same,” Otero said. “Snowflake applications are emerging, some nodes have GPUs for example, and the admin might want to use Warewulf to create a slightly different Linux image that works on those nodes. Warewulf allows them to move that container with the GPUs onto the node, and then Warewulf can just as easily restore that node back to its previous state.”

Node flexibility, scalability, deployment and expansion of clusters, facilitation of system administration tasks – all this is within reach for organizations that rely on HPC clusters to get their work done.

And who knows, maybe some of those researchers, analysts, and designers-turned-sysadmins can spend more time doing what they were meant to do in the first place.

** Talking Heads, “Once in a Lifetime”