irmin-pack: max memory usage 3x, main vs 3.3.1
This issue is based on #1959, but focusing solely on the increased memory usage. From the metrics listed in that issue:
| | 3.2-a | 3.2-b | 3.3.1-a | 3.3.1-b | good-gc-a | good-gc-b
| -- | -- | -- | -- | -- | -- | --
| -- main metrics -- | | | | | |
| CPU time elapsed | 164m49s | 165m36s 100% | 162m50s 99% | 163m50s 99% | 201m20s 122% | 200m53s 122%
| Wall time elapsed | 227m49s | 228m19s 100% | 220m18s 97% | 221m17s 97% | 205m50s 90% | 204m02s 90%
| | | | | |
| max memory usage (bytes) | 0.402 G | 0.395 G 98% | 0.441 G 110% | 0.443 G 110% | 1.413 G 352% | 1.457 G 363%
| mean CPU usage | 72% | 73% | 74% | 74% | 98% | 98%
For good-gc, the "max memory usage (bytes)" is increased roughly 3x compared to 3.3.1. We should identify the cause.
Some initial candidates for increase in max memory usage
- Size of the sparse map file, which is held in memory (these benchmarks likely use gap_tolerance=0)
- (? not sure about this) When calculating reachability, a hashtable is used to avoid revisiting offsets; perhaps this is getting very large?
A comparison using tree.exe benchmark, comparing main to 3.3.1, shows minor memory usage difference:
| | main | 331
| -- | -- | --
| -- main metrics -- | |
| max memory usage | 1.335 G | 1.518 G 114%
| mean CPU usage | 100% | 100%
The size of the mapping file, at least with this particular test run on main, is only 1.7MB; so this is not a source of significant memory usage.
The following gives the machine setups for the benchmarks from #1959. good-gc runs were using ocaml 4.12.1 whereas the others used 4.14. Also gc.space_overhead for good-gc was 80, whereas it was 120 for the other machines. So we can't just compare these stats directly.
| | 3.2-a | 3.2-b | 3.3.1-a | 3.3.1-b | good-gc-a | good-gc-b
| -- | -- | -- | -- | -- | -- | --
| hostname | tezos-action-trace-recording-12-05 | tezos-action-trace-recording-12-05 | tezos-gc-action-trace-recording-12-05 | tezos-gc-action-trace-recording-12-05 | bench-gc | bench-gc
| os type | Unix | <same> | <same> | <same> | <same> | <same>
| big endian | false | <same> | <same> | <same> | <same> | <same>
| runtime params (no gc) | b=1,H=0,p=0,t=0,v=0,w=1,W=0 | <same> | <same> | <same> | <same> | <same>
| backend type | native | <same> | <same> | <same> | <same> | <same>
| ocaml version | 4.14.0 | 4.14.0 | 4.14.0 | 4.14.0 | 4.12.1 | 4.12.1
| word size | 64 bits | <same> | <same> | <same> | <same> | <same>
| Start Time | 2022/06/16 13:27:55 (GMT) | 2022/06/16 19:47:03 (GMT) | 2022/06/16 14:41:02 (GMT) | 2022/06/16 20:32:41 (GMT) | 2022/07/04 17:37:23 (GMT) | 2022/07/05 07:42:00 (GMT)
| Store Type | pack | <same> | <same> | <same> | <same> | <same>
| gc.minor_heap_size | 262144 | <same> | <same> | <same> | <same> | <same>
| gc.major_heap_increment | 15 | <same> | <same> | <same> | <same> | <same>
| gc.space_overhead | 120 | 120 | 120 | 120 | 80 | 80
The OCaml manual has this for gc.space_overhead: The major GC speed is computed from this parameter. This is the memory that will be "wasted" because the GC does not immediately collect unreachable blocks. It is expressed as a percentage of the memory used for live data. The GC will work more (use more CPU time and collect blocks more eagerly) if space_overhead is smaller. Default: 120
So if we just look at the number, it seems that good-gc should have a more aggressive GC, and so should ideally use less memory.
However, in OCaml 4.13.0, GC was changed to best-fit, and various other GC improvements were made. So, again, we can't just compare good-gc directly to the others, when considering max memory usage.
Another comparison: tree benchmark, with gc, irmin main, but ocaml 4.12 vs 4.14. Max memory usage was 1.335G in 4.12, 1.212G in 4.14 (so not much difference).
Aside: Wall time was 13m28s vs 11m01s, so 4.14 seems about 20% quicker on this benchmark.
At this point, it seems that the tree benchmarks do not exhibit 3x memory usage (at least in the runs detailed above). So investigation should continue with the lib_context benchmarks instead.
Something else we should be careful at, is that I think the max memory usage reported by all our benchmarks are for the main process. I started a bench which record the maxrss (Mb) for the Gc child process at the beginning and at end of Gc.run:
begin 131 end 1040 (gc 1)
begin 235 end 1791
begin 344 end 1945
begin 399 end 2166
begin 545 end 2130
begin 619 end 2434
begin 627 end 2442 (gc 7)
the begin value corresponds approx to the maxrss of the parent process.
Are you saying that the main process uses 133MB maxrss at the point the child is forked, and that the child process then gets to 2442MB? The child process is supposed to use almost no memory at all (this was part of the design requirements, and was satisfied I believe in the previous version of the code). Do you think it is possible the new "reachable" calculation has changed the memory usage? (I think there is a hashtable there that perhaps grows very large?)
Memory in the main process is now in a good spot. See https://github.com/mirage/irmin/issues/1959#issuecomment-1219810473
Tracking GC stats for benchmarking here https://github.com/mirage/irmin/issues/2046