In a persistent trend, shrinking IC geometries and higher levels of integration are leading to SoCs packed with increasingly more functionality over time. Much of this functionality is implemented in software running on systems with ever more complex architectures. This creates validation problems that go beyond RTL verification. Software functionality and performance must also be validated, prior to tape-out. Failure to validate the software in the system context can result in costly surprises during post-silicon validation. Unfortunately, traditional RTL simulation tools are not fast enough for any meaningful software validation. However, hardware-based emulation and prototyping platforms can fill the void.
Engineers at Cadence often deploy these platforms as part of a strategy to enable pre-silicon software validation and performance analysis prior to tapeout. These strategies are essential to achieving first-pass success. This article will describe one such experience, in which an FPGA-based prototyping platform was used to validate the execution and performance of software running on a customer SoC, pre-silicon.
The Cadence Protium S1 Xilinx FPGA-based prototyping platform was used in this case study.
Customer Use Case
The subject of this case study is a customer whose SoC consisted of multiple DSP processors, a CPU, 5 Mbytes of SRAM, a DMA controller, and several other peripherals all tied together with a crossbar interconnect (see Figure 1). A highly parallelized real-time signal processing application was to run concurrently on the processors. The application operated on data stored in the processor DRAMs. In the final system implementation, this data would stream from external FIFOs into each DSP via specialized queue input ports. We did not make use of this interface and instead preloaded the data into the DRAMs prior to each test run. This simplified the hardware effort while still allowing us to support our customer’s goal, which was to validate the software from a functional, architectural, and performance standpoint, prior to silicon.
Figure 1: SoC Block Diagram
Using a prototyping platform for this purpose allowed us to:
- Perform cycle-accurate pre-silicon performance analysis and architectural optimization, and
- Conduct early software development and debug with a high level of visibility.
Our goal at Cadence was to facilitate our customer’s goal by automating the prototyping platform tool flows, creating a simplified environment for efficiently running tests, analyzing results, and debugging failures.
The Prototyping Platform Environment
The prototyping platform used for this project has some basic capabilities that provide a foundation for the scripting environment described in the next sections.
First, it has a set of tools that synthesize the design, perform I/O assignment, partitioning, place and route, and generation of design bit files that will load directly onto the prototyping machine. Some of these tools are provided by the prototyping platform vendor, and some by the FPGA vendor.
The prototyping machine itself can run in two modes: interactive or automated. As discussed in more detail below, the automated mode is used for running sequences of tests unattended. The interactive mode is used for debugging sessions.
Each mode has commands (among others) that are capable of:
- Loading the design bit files,
- Resetting the design,
- Running and stopping clocks,
- Forcing and monitoring signals,
- Loading and dumping memories.
In interactive mode, these commands can be entered manually – useful for interactive debug sessions. In automated mode, these commands can be bundled into a Tcl script.
Displaying Waveforms
The signal monitoring command mentioned above samples the value of the signal at the time the command is issued, but it does not produce a waveform display. To enable waveforms, an integrated logic analyzer core must be instantiated with the design on the FPGA. During the compile step, the ILA is interconnected with the desired signals, and triggering conditions are established. While not directly incorporated into the scripting environment, waveform displays proved to be useful during the design bring-up phase, when the basic operation of SoC was being debugged.
We used Xilinx ChipScope for this project.
Porting the Design
Because the prototyping platform is based on FPGAs, porting to it required some minor modifications to the ASIC design. First, the ASIC SRAM macros had to be replaced with FPGA-inferred RAMs. Second, the phase-locked loop implemented within the ASIC had to be removed from the design so that the main clock could be supplied externally. These changes were relatively straightforward and simple to implement.
The Scripting Environment
As part of the software validation process, a series of tests had been written by our customer to exercise various aspects of the SoC. Each test was independent of the others. A goal of the scripting environment was to provide a simple way for our customer to set up and run these tests on the design, in sequence, unattended. Typically, they would run overnight. Each test would:
- Issue reset,
- Load instruction and data memories (block RAMS),
- Run the software, and
- Retrieve the results (stored in SRAM) for later analysis.
In addition, the scripting environment had to support JTAG software debug sessions. During these sessions, registers in the DSPs and/or CPU could be examined or modified, software and hardware breakpoints could be set, and software variables could be displayed or changed. This use mode would usually be invoked when individual tests exposed functional or performance behavior that required more scrutiny. To enable JTAG software debug, an add-on prototyping platform daughter card is required, along with third party JTAG probe and software tools. This is discussed in more detail later.
To drive the scripts, a simple text file format was defined which consisted of an ordered list of tests to be run, along with their defining characteristics. For each test, the test file contained a set of comma-separated parameters, the most important of which were:
- An identifying name for the test,
- An identification of the software to be run on each DSP and the CPU,
- The number of clock cycles to run.
Figure 2 shows a sample test list file.
Figure 2: Sample Test List File
The next sections will describe the major components of the scripting environment. Refer to the flow diagram in Figure 3 showing the interaction of the scripts.
Figure 3: Scripting Environment Tool Flow
The Build Process
Before a design can be loaded onto the prototyping machine, a set of design bit files must be created. These bit files represent the design, mapped onto FPGA cells, partitioned across several FPGAs, and instrumented with the necessary debugging hooks. The process of building these bit files contains a great deal of algorithmic complexity and bookkeeping and can be time-consuming, depending on the size of the design.
There are three scripted steps that implement the build process:
- Synthesis: As input, this takes the RTL files that comprise the design. As output, it produces a platform-targeted netlist, suitable for mapping onto the FPGAs in the prototyping machine.
- Compilation: As input, this takes (a) a list of clocks and resets, and (b) a list of signals to be forced, or monitored (including waveform displays, discussed later). In addition, it takes a list of memories that must be made accessible for loading and/or dumping. The output of this script is a Makefile, which drives the place and route step, described next.
- Place and Route: This step is driven by the Makefile. It automatically partitions the netlist across the requisite number of FPGA chips on the prototyping machine, performs place and route for each FPGA, and runs timing analysis to determine the clock speed.
The result is a set of design bit files that can be loaded directly onto the prototyping machine. The build process can be time-consuming, but once completed can be used to run multiple tests and debug sessions. It is not necessary to regenerate the bit files unless the design changes or additional visibility is required for debugging.
Translating Memory Contents (memSplit)
For each test in the test list, software and data files are specified that must be loaded into the memories. These files are encoded in the standard ELF[(Extensible Linking Format) file format. Before a test could be run, each ELF file had to be translated into multiple memory initialization (MIF) files, one for each block RAM in the memory. This step was necessary because each logical memory space in the design was subdivided into multiple physical memory blocks. For example, a 32KB instruction RAM in the design might have been implemented as four 8KB memory blocks. The ELF file used to initialize this IRAM would have to be split into four MIF files, one for each memory block in the IRAM.
This task is performed by the memSplit script, shown graphically in Figure 4.
Figure 4: memSplit
Running a Sequence of Tests in Batch Mode (ptmRunCreate)
Once the MIF files have been created, the next task is to create a sequence of commands that instruct the prototyping machine to run the tests specified in the test list file. A script called ptmRunCreate was written to perform this task. Given the test list, it produces a Tcl script with commands that:
- Unlock and reserve a prototyping machine on the network, ensuring that no one else is using it,
- Download the design bit files to the FPGAs on that machine,
- Load the MIF files into their appropriate memories,
- Reset the DUT,
- Force values on certain signals to enable startup,
- Run for a specified number of cycles,
- Dump memories for post-analysis,
- Repeat steps 3-7 for each test in the list,
- Exit and release the prototyping machine.
Once the command file was created, it was run unattended on the prototyping machine, typically overnight. A completed test list run will typically create hundreds of dumped MIF files. These dump files contain the ending state of each test run. These test files are then analyzed to determine the status of each test.
Translating Memory Dumps into a Format Suitable for Analysis (memCombine)
A script called 'memCombine' was written to consolidate these separate MIF files into a single ASCII HEX file for each logical memory space in the design. The HEX file was then used for post-analysis. A graphical depiction of memCombine is shown in Figure 5.
Figure 5: Translating multiple MIF files into a single HEX file.
JTAG Debugging Sessions (ptmDebugCreate)
In addition to batch test runs, our customer needed the ability to invoke interactive JTAG software debug sessions on a specified test. This would give them the ability to comprehensively diagnose the test in a standard software debugging environment - allowing them, for example, to set breakpoints, examine registers, and step through the software execution.
To enable JTAG debug, a separate add-on card is required. This card attaches to the daughter card port on top of the prototyping machine. The GPIO pins on this card were then routed to the JTAG debug port on the SoC, as specified in the compile script. A third-party debugger probe was attached to those pins. This, along with additional third-party software allowed our customer to peer into the inner architecture of the DSPs during a software debug session.
We used the SEGGER J-Link Debug Probe with the Xtensa Xplorer software.
A script called ptmDebugCreate was written to set up these debug sessions. Unlike ptmRunCreate, ptmDebugCreate generates Tcl commands to run a single software executable, as opposed to a sequence of executables. Specifically, the Tcl commands perform the following steps:
- Unlock and reserve a prototyping machine on the network, ensuring that no one else is using it,
- Download the design bit files to the FPGAs on that machine,
- Load the MIF files into their appropriate memories,
- Reset the DUT,
- Force values on certain signals to enable startup,
- Fun the prototyping machine clock continuously.
Once the Tcl script is run, a user can start a JTAG debug session with the third-party debugger probe and software.
Strategies Employed for the Successful Completion of this Project
Several strategies were instrumental in the successful completion of this project.
Plan for Long Compile Times
Probably the most important strategies derive from the long compile times in the build step. These long compile times demand careful planning on several fronts.
First, to effectively handle the load, the capabilities of the compute server must meet the resource requirements dictated by the size of the design. In this case, the design required a machine with several hundred Gigabytes of RAM, and a high-speed local solid-state drive. In our experience, network drives did not provide acceptable performance. Even with this optimized server configuration, the time required to build design bit files would take anywhere between 15 to 36 hours.
Second, given that signals to be forced and monitored are specified during the compile step, it is particularly important to determine in advance which signals need to be observed for a successful debug session. Adding new signals to the force/monitor/waveform list requires re-creating bit files, which can cost a full day on the schedule. Too many compilation iterations can significantly slow down debug.
Keep FPGA Utilization Under 70%
When FPGA utilization rises above a threshold, compute times for the place and route algorithm can grow significantly – requiring unacceptably long times, and in extreme cases failing to complete. For this project, the threshold was about 70% utilization.
There are a few things that can be done to control utilization. Making signals accessible for forcing and monitoring increases utilization. Making memories available for loading and dumping is even more impactful, increasing utilization at a much higher rate than signals. Cutting down on accessible internal components, especially memories, can help keep utilization below the threshold.
Watch Software/Hardware Version Compatibility
While it may seem obvious, it is important to ensure that software and hardware versions are compatible. Tools from multiple vendors are used in the validation flow, including the prototyping platform vendor, the FPGA vendor, and the third-party JTAG debug tools. Mismatches can result in subtle errors that can be difficult to trace back to version incompatibility. Carefully tracking versions can avoid this problem.
Streaming Data from External Sources Can Be Slow
There are several options for streaming data into the design from an external source. The prototyping platform (in this case, the Protium S1) has a PCI-E bus available for this purpose, as well as an SoC I/O daughtercard for simple DUT I/O. For this project, we chose to preload memories with the stream data because of its relative simplicity and high performance. On the downside, this approach pushed our FPGA utilization up, with the attendant problems discussed above.
Conclusions
The custom scripting tools developed for our customer created a straightforward automated environment that simplified the use of the prototyping platform. In addition, the debug capabilities of the platform in combination with the third-party JTAG debugging tools enabled a rich and efficient debug environment, capable of diagnosing simple and complex problems. Last, but not least, we were able to achieve a 6.67MHz clock speed, which is significantly faster than typical emulation speeds. This allowed us to run many more test cases than would otherwise be possible and provided an efficient, cycle-accurate platform for performing pre-silicon performance analysis and optimization.
The result was a successful validation of software execution and system performance ahead of ASIC tapeout.