We all know the benefits of gate-level simulation, but we also all know the pain of trying to run it. Setting up and running GLS on large designs is a huge time sink for engineers—but what if there was a better way? A large team at Northrop Grumman faced this exact issue working on a large ASIC. They found some clever ways to vastly speed up their gate-level simulation, and they shared their experience in a presentation at CadenceLIVE Boston 2023.
Why would they want to do GLS with such a large design, though? For one, many functional bugs can’t be found through RTL simulation alone. You can’t use X-optimism strategies to test the propagation of unknown states, for one. It also struggles with testing for simulation races, ideal timing, and other timing bugs. In addition, synthesis and test insertion tools can create their own functional bugs—you’d need to simulate at the gate-level to find these. Despite using lint, formal, and STA tools, the rules, constraints, and waivers that bind each of these can still hide actual issues. In this case, GLS with SDF annotation provides a cross-check against the results from other tools.
Generally, though, GLS is fairly slow. While reusing RTL testbenches for GLS could speed things along, there’s some challenges with doing this. RTL testbenches typically don’t consider timing, for one. Agents for RTL don’t have adjustable timing and tests will often perform random resets, causing false timings and functional fails.
Beyond adapting RTL testbenches for GLS, Northrop Grumman also wanted to utilize Xcelium to speed up their compilation, elaboration, and simulation times by using its multi-core functionality. It’s quite the set of prospective improvements, but the team was ready for the challenge.
First, they set out to resolve testbench timing violations. This was accomplished by shifting the agent clock. They found two methods for doing this: one involved balancing the delay in the agent (DLY_agt) with the delay in the clock (DLY_clk). This required creating a new DLY_agt for each area with specific timing requirements. If this wasn’t possible or not desired, they also could tap the Agent clock from the DUT’s internal clock endpoint. This worked for all timing corners. A secondary DLY_agt can be used to fine-tune any additional timing requirements. There’s less iteration required with this method, and it better takes advantage of the work already done by the RTL team.
Figure 1: Two methods for resolving timing violations
To add timing to RTL and behavioral models, wrappers with delays were created. SystemVerilog configurations were used to substitute the original instance with the wrapper—this allows the system to ensure clocks are aligned before anything is passed to the DUT.
Figure 2: Wrappers and the SystemVerilog code to configure them
What about the random timing checks? The team at Northrop Grumman solved this by using breakpoints on the reset entry and reset exit conditions. For those planning to implement similar strategies, the team recommended creating an auxiliary signal to trigger the reset exit breakpoint before the actual reset exit condition—this prevents timing violations from being masked by the actual reset de-assertion.
It was switching to multi-core simulation that gave Northrop Grumman its biggest speed improvement, though. By utilizing -mce, -mce_disable_nocellaccess, -mce_build_cpu_configuration single-socket -mce_build_thread_count 2 and -mce sim_cpu_configuration single-socket -mce_sim_thread_count 32, alongside the LSF options listed below, Northrup Grumman was able to perform full-chip compilation and elaboration with SDF annotation in only eleven hours—less than half the time it took with single core.
Figure 3: LSF options
When using 8 threads, performing block simulation with SDF annotation for thirty-three tests, Xcelium allowed Northrop Grumman to reach a 4.2X reduction in runtime. In one case, a forty-six-hour test completed in only eleven hours! With some of the tests taking over a week to complete using single-core, this speedup is huge.
They also wanted to test top simulation in zero-delay mode with Xcelium Multi-Core. With this, the team at Northrop Grumman tested eight, sixteen, and thirty-two threads. In one thirty-two thread test, a 20.5-hour runtime when using single-core was reduced to only 5.5 hours! Similarly, when testing using thirty-two threads in top simulation with SDF annotation, an eight-day test was reduced to only 2.7 days, with an average run-time reduction of 2X.
Things are looking fast over at Northrop Grumman now that Xcelium Multi-Core is in the mix. Between Xcelium’s runtime improvements over single-core and their efforts to adopt and re-use RTL testbenches for GLS, gate-level simulation is no longer the daunting task it was before. Interested in how Xcelium can help you? Check our Xcelium Multi-Core page for more information.