CXL is emerging as the industry focal point for coherent I/O with Open CAPI and Gen-Z transfer specification and assets to CXL Consortium. In August, the next full version of the CXL 3.0 standard was announced. With the continued proliferation of cloud computing, AI and analytics, increasing need for system-level optimization among high performance accelerators, system memory, smart NICs and leading edge networking. The new standard version introduced memory-centric fabric architectures and expanded capabilities for improving scale and optimizing resource utilization, which could change how some of the world’s largest data centers and fastest supercomputers are built.
Double the bandwidth and zero added latency
CXL 3.0 builds on top of PCIe 6.0 and inherits the full bandwidth improvements of PCIe 6.0 along with Pulse Amplitude Modulation 4-level (PAM4) and Forward Error Correction (FEC), doubling total bandwidth to 64 GT/s. Notably, CXL 3.0 takes one step further in reducing latency, resulting in CXL 3.0 having the same latency as CXL 1.x and CXL 2.0.
CXL 3.0 bumps 68-byte FLIT in CXL 1.x/2.0 up to 256 bytes. The standard CXL 3.0 FLIT is very similar to the PCIe 6.0 FLIT layout, with a 2-byte FLIT header, to indicate the protocol stack CXL.io, CXL.cachemem. The larger 256-byte FLIT size is one of the critical communications changes with more bits in the header FLIT, which enables the complex topologies and fabrics in CXL 3.0 standard.
CXL 3.0 also offers a Latency-Optimized (LOpt) version of FLIT mode that breaks up the Cyclic Redundancy Check (CRC) into 128-byte half FLIT granular transfers to mitigate store-and-forward overheads in the physical layer. A 6-byte CRC independently protects each half, but the FEC is across the whole 256 bytes. It allows consuming 128 bytes with good CRC for much lower latency. Notably, it cannot go back and forth between standard and LOpt modes.
New protocols (UIO and BI) for enhanced coherency and peer-to-peer communication
CXL 3.0 enables non-tree topologies and peer-to-peer communication (P2P) within a virtual hierarchy of devices that associates with devices that maintain a coherency domain. It disaggregates to allow the device-to-device connectivity directly rather than go through the host for every communication to overcome the host bottleneck in the tree topology. These two major powerful enhancements are Unordered IO(UIO) and Back-invalidation (BI) protocols.
The producer-consumer ordering semantics are enforced at every entity, whether a switch, an endpoint, or a root port. These all enforce the same ordering semantics. UIO is a way to break what unordered IO does. It moves the producer-consumer enforcement to the source, avoids unnecessary traffics, and enables parallel paths to deliver better bandwidth and latency. And all of those are the factors to allow peer-to-peer.
BI protocol maps large memory in Type 2 devices to Host-managed Device Memory (HDM) with back invalidation. In CXL 1.x/2.0, the bias flip mechanism needs HDM to be tracked fully since the device could not back-snoop the host. Back invalidation with CXL 3.0 enables snoop filter implementation in large memory mapped to HDM.
Multi-level Switching and Fabric Management
A significant addition to CXL 3.0 is multi-tiered switching and switch-based fabrics. CXL 2.0 allows for a single switching layer, with switches connecting vertically to upstream hosts and downstream devices but not supporting connections to other switches. The scale is limited to the available ports on a switch. With CXL 3.0, switch fabrics are enabled, where switches can connect to other switches, and each root port can connect to more than one device type, vastly increasing the scaling possibilities.
CXL fabric is available in virtually any configuration needed for a system. This composability combines heterogeneous compute elements, Type 1, 2 or 3, into the overall system with no restrictions on architecture. These fabric-enabled systems can be dynamic, flexible, and intelligent, allowing system design around any application and right-sizing resources instead of over-provisioning. Multi-headed and fabric-attached devices enhance fabric management and composable disaggregated infrastructure.
Improved Memory Sharing and Pooling
CXL 3.0 enables Global Fabric Attached Memory (GFAM) by disaggregating the memory from the processing unit and implementing a large shared memory pool memory can be of many different types, e.g. mixture of DRAM, and NAND flash, which can be accessed by multiple processors directly connected to GFAM or through a CXL switch. Even if the disaggregated memory is spread around the rack, the access time is still fast. Rack-scale memory fabric is a step on the journey to realizable memory-centric computing.
Fully Backwards Compatibility with CXL 1.x and CXL 2.0
Finally, CXL 3.0 ensures full backward compatibility with CXL 1.x and CXL 2.0, devices and hosts can downgrade as needed to match the rest of the hardware chain, albeit losing newer features and speeds in the process.
Summary
CXL 3.0 features facilitate the move to distributed, composable architectures and higher performance levels for AI/ML and other compute-intensive or memory-intensive workloads. The CXL 3.0 protocol can support up to 4,096 nodes, go beyond rack. Composable server architectures are when servers are broken apart into their various components and placed in groups where these resources can be dynamically assigned to workloads on the fly. CXL technology continues to enable game-changing innovations for the modern data center at scale.
Cadence CXL 3.0 Verification IP Press Release, which you can read about on the product page Simulation VIP for CXL.
More Information:
- For more info Cadence PCIe Verification IP and TripleCheck enables users to confidently verify ew disruptive changes, see our Simulation VIP for CXL, Simulation VIP for PCIe and TripleCheck for PCIe and CXL
- For more information on CXL in general, see CXL Consortium website