Enabling new server architectures with CXL Interconnect

The ever-growing demand for higher performance computing power motivates exploration of new compute offload architectures for the data center. Artificial intelligence and machine learning (AI/ML) are just one example of the increasingly complex and demanding workloads that are forcing data centers to move away from classic server computing architecture. These more demanding workloads can benefit greatly from lower latency coherent storage architectures. This is where the standard Compute Express Link (CXL) comes into play.

First introduced in 2019, CXL has become a new enabling technology for connecting computing resources. It provides a way to interconnect a wide range of computing elements, including CPUs, GPUs, system-on-chip (SoC), memory, and more, in a memory-cache coherent manner. This is particularly compelling in a world of heterogeneous computing, where purpose-built accelerators offload targeted workloads from the CPU. As workloads become more demanding, more storage resources are provisioned with accelerators. CXL gives us the ability to share these memory resources across CPUs and accelerators for greater performance, efficiency, and improved total cost of ownership (TCO).

CXL adopted the ubiquitous PCIe standard for its physical layer protocol, capitalizing on the standard’s tremendous momentum in the industry. At that time, CXL was first released, PCIe 5.0 was the latest standard, and CXL 1.0, 1.1 and subsequent 2.0 generation all used PCIe 5.0 32 GT/s signaling. CXL 3.0 was released in 2022 and adopted PCIe 6.0 as the physical interface. CXL 3.0, like PCIe 6.0, uses PAM4 to increase signaling rates to 64 GT/s.

To support a variety of use cases, the CXL standard defines three protocols: CXL.io, CXL.cache, and CXL.mem. CXL.io provides a non-coherent load/store interface for IO devices and can be used for detection, enumeration, and register access. CXL.cache allows devices like accelerators to efficiently access and cache host memory to improve performance. With CXL.io plus CXL.cache, the following usage model is possible: An accelerator-based NIC (a Type 1 device in CXL jargon) would be able to coherently cache host memory on the accelerator, perform networking or other functions, and then commit ownership of the memory to the CPU for further processing.

READ :  Partner Content: Accelerator for HPC Success

The combination of the CXL.io, CXL.cache and CXL.mem protocols enables another compelling use case. These three protocols allow a host and an accelerator with attached memory (a Type 2 device) to coherently cache memory resources. This can provide tremendous architectural flexibility by giving processors, whether they are the hosts or the accelerators, access to greater capacity and memory bandwidth across their combined memory resources. One application that benefits from lower latency coherent access to CPU-attached memory is natural language processing (NLP). NLP algorithms require a large amount of memory, typically larger than can be contained on a single accelerator board.

Rambus offers a CXL 2.0 interface subsystem (controller and PHY) as well as a CXL 3.0 PHY (PCIe 6.0 PHY) that are ideal for high-performance devices such as AI/ML accelerators. These Rambus solutions benefit from over 30 years of experience in high-speed signaling as well as extensive experience in PCIe and CXL solutions.

Additional Resources

Lou Ternullo (all posts)

Lou Ternullo is senior director of product marketing at Rambus.