Unfortunately, in some occasions, this cup of coffee will be very much needed in order to debug why Specman managed the memory this way, and why all of a sudden I can't run my simulation since it is consuming too much memory, or just failing for other various reasons that are related to memory.
The key to resolving these kind of issues (or better yet - not to reach these kinds of situations all together) is to understand how Specman decides which mechanism to use each time and how to influence those decisions. This paper will hopefully help you understand these points.
Let's start with a quote from documentation (Incisive Enterprise Specman Elite Testbench Specman Performance Handbook, chapter 5 - Managing Specman Memory):
The Specman memory manager periodically frees up space in memory by automatically recycling unusable data. Any object (struct, list, string, or long integer) that is not used by the Specman scheduler or is not connected to sys is considered unusable and defined by Specman as garbage. The recycling process is called "garbage collection" or "GC".
OK -- so garbage would be any unused data that Specman had allocated memory for. In order to identify what is garbage and what is not, Specman determines a very simple thing: whatever is reachable from sys struct (which is the root of anything a user would use) or the scheduler, is NOT garbage. All the rest IS garbage.
Types of Garbage Collection
Now Specman has to get rid of that garbage. Performing the common and most efficient process to do so, which is called Copy Garbage Collection, Specman would simply copy everything that is reachable to a different place in memory, and then free up the entire "old" memory. That would leave us "garbage free."
The main problem with this method is that it might consume a lot of extra memory from the OS just to accomplish this procedure. In worst case (that would be when we have no garbage), the entire memory image would be copied, and in the peak time of that procedure, Specman will consume double the memory than it actually needs.
If there is enough free space on the machine, this is not an issue. The problem starts when the machine can't allocate that much memory for the Specman process. In this case doing Copy GC will be not useful.
This is when the other type of GC is introduced - the Disc-Based GC. This GC comes to tackle the memory prodigal nature of Copy GC, and it does that by doing the exact same procedure as described above, but instead of copying the entire memory image into a different location in memory It copies it to a disc. This mechanism will be automatically used by Specman when it gets an "out of memory" indication from the OS, during the time of Copy GC.
Obviously, the Disc-Based GC will consume more CPU cycles, since it uses an external disc which has a (sometimes major) performance hit.
So what if I don't have enough memory to waste on a Copy GC, nor enough time to waste on Disc-Based GC? This is when the third type of GC gets into the picture - The On-The-fly GC. OTF GC is great from that perspective. It's a lot quicker than Disc-Based GC, and uses less memory by far than the Copy GC. It can also operate on demand by a method call (do_otf_gc()) and will kick in during a Specman tick - a thing that will not happen with the other two GC mechanisms that require a tick (you can think of it as a similar thing to a simulation delta cycle) in order to operate.
However, nobody's perfect. OTF GC does not use the copy principle, but the "mark and sweep" principle, which performs the cleaning job a lot less efficiently.
So as you can see, those three mechanisms complete each other, and each mechanism will be better used in different scenarios. So... who is in charge of when to use each?
GCs Triggers and Settings
It depends... You can ask Specman to do it by using the Automatic Garbage Collection Configuration mechanism (by setting configure memory -automatic_gc_settings=STANDARD - this is actually the default), or keep the decision making to yourself. If you do decide to maintain the memory management by yourself there are few configuration flags you should know and handle:
gc_threshold: The amount of memory that Specman can consume before copy GC kicks in.
max_size : Maximum amount of memory that Specman should use. This is the threshold for OTF GC.
absolute_max_size: Maximum amount of memory that Specman can use before fatally terminating execution.
disable_disc_based_gc: Don't use Disc-Based GC when Specman is "Out Of Memory" during GC (I probably don't have enough disc space, or can't wait for the slower Disc-based GC). Specman will use OTF GC instead.
On the other hand, in case you want Specman to manage those configurations, all you should specify is the optimal_process_size that you would want for this process, and Specman will determine the values above automatically. You should be aware that manually changing any of the above values will cancel the Automatic GC Configuration mechanism.
If you just asked yourself: which one should I use? You probably better off trying the automatic mechanism first. This mechanism can handle most traditional mainstream environments. Nevertheless, if yours isn't such, or you find yourself suffering from memory issues, then switching to the manual mechanism will give you a lot more flexibility to tackle these issues.
Types of Failures
So, what are these memory issues mentioned above?
Let's differentiate between two types of issue:
1. Specman Memory Management errors / warnings
2. OS errors
The first one is easier and usually can be solved by changing the Memory Management parameters. It also deals only with Specman memory consumption and not the entire process that Specman is a part of. The first and more common error you might get from Specman's memory management is the following error:
*** Error: Total memory requested from operating system exceeds get_config(memory,max_size) (419430400). Will exit at get_config(memory,absolute_max_size) (471859200).
So this error tells us that we had hit some pre-defined values (max_size, absolute_max_size) that are supposed to limit Specman's memory consumption. These values are described above, and will be set either automatically by the automatic mechanism (which means in this case that the automatic mechanism was not very successful...) or by the user himself. There is another option in which the user does not use the automatic mechanism, but still did not specify those values. In this case, default values would be assigned to them (the numbers in the specific message above).
Another issue you might have -- one that is related to Specman Memory Management -- is getting too many consecutive GCs. This will slow down simulation tremendously, and will keep on printing GC messages to the screen. This will happen when the gc_threshold or max_size parameters are not tuned up correctly. This means that the "steady" memory Specman needs is actually higher than those thresholds, sending Specman to do GC all the time, but it will probably never get to equilibrium.
These are the easy errors to handle -- and you probably wish to be left dealing with them, as both of the problem's solutions are pretty easy.
On the other hand, the second type of errors -- OS errors -- is a bit more tricky to deal with. Let's first look at what we mean by a process:
PID #123456
As you can see from the diagram, the process with PID #123456 (just an example) can contain several applications running together, and sharing a specific allocated segment of memory on the machine.
If we are dealing with 32 bit applications, the maximum size of this process could not go over 4GB (if you run the 32 bit process on 32 bit machine - the OS allows just about 3G of user data to the process, and the rest is allocated to the kernel, so the quota is even less).
With 64 bit applications, the size is not limited by a specific number, but by other specific OS considerations. However, there might be several other processes on that machine that need some memory as well. For this the OS uses what's called "virtual memory" which fragments some of the process memory and puts it on disc instead of the RAM. We will not go into details of this mechanism now, but in general the entire process size (In RAM + disc) is called VSIZE, and this size is limited by the OS.
You can now understand, that when this VSIZE goes over what is given by the OS to this process, it doesn't really matter (to the OS) which of these applications is responsible for this amount of memory requested. But when it decides to "cut off" the supply (i.e. a memory allocation request will return NULL) the application that was doing the malloc in that time frame would have to deal with it, but the entire process is thrown out. Every application can deal with it differently. Specman will issue an "Out Of Memory" error in the best case, or crash with an OS11 signal in the worst case (usually when some memory corruption was caused by this excessive usage of memory).
When dealing with these kind of situations, we need to first identify which of these applications is the one taking most memory, and moreover, which is the one that can go on a diet for the sake of this process.
For this analysis, there are several tools and utilities we can use for every application. For Specman we have the memory debug flags, and the memory profiler. These will not be discussed in this paper, as debugging techniques are a broad issue which requires an entire paper to go over.
Summary
If we understand the basic behavior of Specman's Memory Management mechanism , we will be able to identify most of the memory problems that pop up during our work. It is very important to identify whether it's a Specman error that is complaining we have a setting that is responsible for this error, or an OS error that is complaining about the process that Specman is a member of.
The most important thing is to identify when Specman is not behaving as we think it should be. If we will identify this point, and the reason for this "misbehavior," we will be half way towards the resolution of the problem (or the identification of a Specman bug...).
The below flow chart is very useful to use when you want to analyze Specman actions and reactions, and the relations between all types of GCs and what will kick in each one of them. It will also show the paths to errors that could help identify what you need to focus on in order to make them go away.
Avi Farjoun, Cadence Design Systems