ovis
ovis copied to clipboard
shm documentation request, part 2
I ran into an issue with my metric set size configuration. I had to change from the default 1024 to a larger metric set. I'm not sure what I did to exceed this limit, or how I would know what to set shm_metrixmax to without a trial and error process.
I wondered if you could also explain other config options that are available in more detail. In particular, shm_arraymax and shm_boxmax.
@iramin any further clues for @qwofford ? If you covered it in private email, please post something here or make a pull request with improvements to the in-tree documentation.
About the issue with the configuration, we need more information about the setup and configuration to figure out what was the cause of the problem.
Here, I provide more information about these config options.
shm_boxmax imposes a limit on the number of entries in the shared memory index file, i.e., number of different applications that we want to monitor simultaneously. For example in the following figure, we have n entries in the shared memory index.
Figure 1 of the paper:

shm_array_max imposes a limit on the number of elements that can be in a single array metric. For example, if we are collecting the number of calls to MPI_Send by different ranks and store all of them in a single array, we should not exceed this limit. If the limit is 64, we cannot put more than 64 numbers in a single array.
shm_metric_max imposes a limit on the number of all metrics in a metric set. For example, if we have an array-based metric set with 10 metrics and each metric has 64 elements in the array, the number of total metrics is 640.
For example, in the following figure, we have n metrics in total.
Figure 2 of the paper:

This helps me, thank you @iramin .
I suppose shm_metric_max exists in case we aren't paying attention to the vlaue of shm_boxmax*shm_array_max and accidentally exceed our sampler memory usage tolerance?
Stated another way, when would we set shm_metric_max to a value that is not exactly shm_boxmax*shm_array_max?
Thank you @qwofford for using this sampler and reporting the issues.
In the Shared Memory Index, we can have multiple entries corresponding to each application that we are monitoring.
For example, let's say we are monitoring 4 applications (Nalu, miniMD, CoMD, lammps) with a single shm_sampler. We have 4 entries in the shared memory index. (shm_boxmax >=4 ).
For each entry/application, we can have a different metric set.
For example:
Nalu --> 4 metrics(MPI_Send, MPI_Recv, MPI_Issend, MPI_Irec) and 4 ranks. We have 4 array-based metrics and each array has 4 values. (shm_array_max >= 4)
Total metrics=16 <= shm_metric_max
miniMD --> 2 metrics(MPI_Send, MPI_Recv) and 16 ranks. We have 2 array-based metrics and each array has 16 values. (shm_array_max >= 16)
Total metrics=32 <= shm_metric_max