Have you ever experienced the frustration of fixing a bug during the design stage only to discover additional bugs while trying to fix the existing one? As a verification engineer, this can be a common problem that can cause previously passing tests to fail suddenly, leaving your team to pick up the pieces. Ensuring the logic functionality early in the design is the holy grail of functional verification and debugging teams. To mitigate such issues and help verification teams meet their time budgets, there is a need for a generational shift from traditional single-run, single-engine schemes to schemes that leverage big data and generative AI across multiple runs of multiple engines throughout an entire SoC verification campaign, such as Cadence Verisium. Cadence Verisium is an Artificial Intelligence (AI)-Driven platform that optimizes verification workloads, boosts coverage, and accelerates root-cause analysis of bugs. It includes a new generation of AI-driven verification apps that help to reduce silicon bugs and accelerate time to market. This blog will discuss how the Verisium AutoTriage app helps cluster failures from the common root cause.
Why AutoTriage?
Triage is that part of the design/verification process where failing test results from a regression run are prepared for analysis, categorized, and dispatched to available resources where the real debug takes place. During the regression stage of a chip design/verification project, triage is commonly used for prioritizing and determining the required resources for multiple failing tests. At this stage, a viable regression suite exists with enough of the chip and its testbench to drive the team's day-to-day and week-to-week activities. Failure triage analyzes large sets of failures, groups likely to be caused by the same design error, and then allocates those groups to the appropriate engineers for fixing. The traditional failure triage group runs by their failure description and cannot identify whether the failure is from the same bug. Many different failure messages could be associated with the same bug, or many bugs can cause one failure message.
Regression failure triage is a repetitive, time-consuming operation and often requires a comprehensive analysis of the probable root causes, which informs the assessment of severity, consequences, related topics, and ultimately, the evaluation of which resources are required to resolve the issue. The user must identify the failures for each new regression. So, manual failure analysis of regression is very costly and inefficient as various conditions could characterize bugs, and the automated scripted solutions are not innovative enough.
Cadence Verisium AutoTriage builds ML models that help automate the repetitive task of regression failure triage by predicting and classifying test failures with common root causes. It offers the automated ML bucketing of regression failures and many other improvements, such as:
- Efficiency
- Automation
- Knowledge sharing and reuse
Once the runs are associated with the failure cluster, Cadence Verisium Manager would examine the properties of the associated runs and try to predict matching failure clusters for future failed runs.
How Cadence Verisium AutoTriage Helps
Cadence Verisium’ s Auto Triage app uses ML techniques to "learn" from user manual assignments and "predict" future assignments automatically. When the user runs the regression, the associated ML creates and proposes a failure cluster. Once the user associates the failure run with a failure cluster, the ML solution extracts the properties of that run as characteristics of the failure cluster. The new runs are then considered candidates of the failure cluster. This helps to streamline the process of identifying and addressing failures, making it more efficient and accurate.
Use Case
Cadence Verisium Auto Triage with ML classification is a tool that automates repetitive tasks, even without historical data. Early users have reported up to 70% savings in the time required to debug. To illustrate its effectiveness, let's consider a use case from an existing customer who had 3 modules in regression for Cadence Verisium AutoTriage evaluation and experienced 163 failures across 23 error types. The customer compared manual triage with auto triage and the methodology used along with the results are presented below.
The methodology used for the comparison of Manual and Auto triage is as below
Manual Triage
- Create a script, check the failure log, and categorize failure results manually.
- Measure the accumulated man-hours needed to do a manual triage.
Auto Triage
- During the initial run, the tool can automatically propose new clusters based on the current failed runs. The user will be required to associate the failed runs to the cluster, alternatively, the user can manually create a failure cluster and assign the failed runs to each cluster.
- From the second cluster onwards, AutoTriage can automatically create the categories of new failed results.
- Measure the accumulated man-hours needed when using AutoTriage.