If you have worked with Peripheral Component Interconnect Express (PCIe) in the past, you might have heard Compute Link Express (CXL) is break-through technology for modern day compute requirements driven by high-performance computing, cloud, AI and ML. Of course, CXL buzz is for real and is well resonating with big industry players in processing and storage landscape. We are already seeing pre-production CXL design demos claiming interoperability with Intel Xeon Sapphire Rapid processor. One of the major enhancements with CXL is the ability to add more system memory which is cacheable across various devices connected on CXL link. These feeds ever-increasing demand for more processor memory and higher computation speed by utilizing powerful accelerators. Let’s take a closer look at CXL.cache semantics in this blog.
CXL.cache is developed to allow Devices to access and cache Host attached memory. This was not possible with the traditional load-store format of PCIe.
The typical workflow between CXL Host (aka CPU) and CXL Device (aka accelerator) would include the following steps:
- Load Device memory IO mapped memory for work submission
- Device independently performs the task
- Device intimates the processor on completion of the task and returns data(if required)
This works well for traditional non-coherent I/O PCIe link where there is little interaction required between Host and Device, but it slows down the system if Device needs to go back and forth to Host Memory for data. With CXL.cache, accelerators can fetch required data from Host and save it in Device Cache, perform the task and update the Host on work completion. It enables a close working relationship between Host and Device and provides an opportunity for a special purpose-built accelerator to improve the system's overall performance significantly.
CXL.cache works on MESI (Modified, Exclusive, Shared, Invalid) coherence protocol. And, by principle link is developed as an asymmetric link which means that Home Agent (dealing with cache coherency resolution) is required to be implemented only in CXL Host. This ensures CXL device’s coherency agents are lightweight. Let’s take an example of CXL.cache scenario where Device requests Host to provide ownership of cacheline, which is in Modified state.
In the above simple operation, you can see how the CXL link maintains coherence in Type1 devices with only Host Memory cacheable by both Host and Device. Multiple such operations need to be performed at 4KB page boundary for typical workload. Flows become complex with multiple request/response transactions for one operation when Device Memory is also visible to the system and is cacheable by both CXL Host and Device. These devices are called Type2 CXL devices. Another verification aspect is performance/latency. Technically speaking, cache should provide better performance and low latency, but industry needs to watch out when CXL designs actually hit the market.
In parallel, the CXL consortium is actively working on bringing in more use cases from member companies and enhancing future revisions to facilitate tighter collaboration and interoperability among designs. It is imperative that robust verification is required not only to ensure compliance to CXL.cache semantics but to meet and beat desired performance metrics for heterogeneous computing. Cadence CXL Verification IP built on top of proven PCIe Gen5 Verification IP is well-placed to help accelerate your verification cycle. In addition, Cadence System Verification IP (VIP) is designed to provide insights on score boarding and performance from a system perspective with CXL, DDRs, HBM, AMBA IPs, etc. connected in the fabric would be a helpful debug-aid. Stay tuned, for upcoming blog on System challenges with CXL and how System VIP can enable quick debug for our customers.
More information on CXL VIP and System VIP, check VIP for CXL, Triplecheck for CXL and System VIP.