Random generation is always a complex task, and differences in results are usually very hard to debug. Besides, generation misbehavior always rings many bells in R&D :-)
A customer reported a random stability issue, explaining that the generator (IntelliGen) generated different values with the same seed. One simulation was started from vManager, the other in a Unix shell, and they ran in different run modes (compiled vs. interpreted).
Looking into the (quite complex) environment, it turned out that the beginning of the simulation was identical, but as time advanced the results started to differ. I assume some of you have experienced similar behavior in the past.
A first look revealed that complex list manipulations were performed in many levels of nested method calls. Each list -- a list of units -- was manipulated by several list methods (sort, add, unique, etc.). The results were printed out to the screen where, after a while, the lists started to differ.
So an idea came to mind: The problem is probably not a generation issue, as the static generation was identical in all cases; rather it is a runtime issue, most likely caused by list manipulation. But how could the way the simulation was launched or the run-mode be responsible for the differences?
With no alternative, we proceeded to debug through the source code step by step, and examined the list after every manipulation. The complexity and deep nesting of the code (there were even recursive methods that touched those lists), resulted in about two days of painstaking analysis without finding a difference. Then, we hit pay dirt -- we came across the construct where the lists began to differ. Below is the sample code:
So where is the problem? The code identified that the list of units that was sorted, but did not identify the specific field used for sorting -- the argument (it) referred to the unit itself.
Looking up the sort() method in the Incisive/Specman documentation, we found the following:
The Note in the description seemed to suggest a possible clue.
A scan of the log files showed that the vManager run started a garbage collection at some point before the sorting action, and that the plain Specman simulation logs did not show this garbage collection. This difference in behavior was the result of different memory settings between different simulation runs.
The bottom line: Garbage collections can change the physical memory address of the units in the list, which can affect the sorting of these addresses before and after such a memory operation.
Summary:
- The root cause for the problem was that the sorting statement contained a bad argument (the result of a copy and paste error) -- the construct was taken from code where it worked perfectly fine with a list of strings.
- The problem was not generation-related, even if it looked that way in the first place.
- More importantly, we uncovered usage of a problematic construct: The user based the sort on the physical address. This should be avoided even if garbage collection is not performed (and all the more so when garbage collection is performed).
- Uncovering such an issue is a perfect task for the Specman Linter. Cadence intends to enhance the linter's capabilities in that direction.
Hans Zander