irmin irmin-pack: max memory usage 3x, main vs 3.3.1

This issue is based on #1959, but focusing solely on the increased memory usage. From the metrics listed in that issue:

 |                          |   3.2-a   |     3.2-b      |    3.3.1-a     |    3.3.1-b     |   good-gc-a    |   good-gc-b
 | --                       | --        | --             | --             | --             | --             | --
 | -- main metrics --       |           |                |                |                |                | 
 | CPU time elapsed         |   164m49s |   165m36s 100% |   162m50s  99% |   163m50s  99% |   201m20s 122% |   200m53s 122%
 | Wall time elapsed        |   227m49s |   228m19s 100% |   220m18s  97% |   221m17s  97% |   205m50s  90% |   204m02s  90%
                          |           |                |                |                |                | 
 | max memory usage (bytes) |   0.402 G |   0.395 G  98% |   0.441 G 110% |   0.443 G 110% |   1.413 G 352% |   1.457 G 363%
 | mean CPU usage           |       72% |       73%      |       74%      |       74%      |       98%      |       98%

For good-gc, the "max memory usage (bytes)" is increased roughly 3x compared to 3.3.1. We should identify the cause.

Jul 19 '22 08:07 tomjridge

Some initial candidates for increase in max memory usage

Size of the sparse map file, which is held in memory (these benchmarks likely use gap_tolerance=0)
(? not sure about this) When calculating reachability, a hashtable is used to avoid revisiting offsets; perhaps this is getting very large?

Jul 19 '22 09:07 tomjridge

A comparison using tree.exe benchmark, comparing main to 3.3.1, shows minor memory usage difference:

    |                          |   main   |      331
    | --                       | --       | --
    | -- main metrics --       |          | 
    | max memory usage         |  1.335 G |   1.518 G 114%
    | mean CPU usage           |     100% |      100%

main-vs-3.3.1.txt

Jul 19 '22 11:07 tomjridge

The size of the mapping file, at least with this particular test run on main, is only 1.7MB; so this is not a source of significant memory usage.

Jul 19 '22 11:07 tomjridge

The following gives the machine setups for the benchmarks from #1959. good-gc runs were using ocaml 4.12.1 whereas the others used 4.14. Also gc.space_overhead for good-gc was 80, whereas it was 120 for the other machines. So we can't just compare these stats directly.

 |                          | 3.2-a                              | 3.2-b                              | 3.3.1-a                               | 3.3.1-b                               | good-gc-a                 | good-gc-b
 | --                       | --                                 | --                                 | --                                    | --                                    | --                        | --
 | hostname                 | tezos-action-trace-recording-12-05 | tezos-action-trace-recording-12-05 | tezos-gc-action-trace-recording-12-05 | tezos-gc-action-trace-recording-12-05 | bench-gc                  | bench-gc
 | os type                  | Unix                               | <same>                             | <same>                                | <same>                                | <same>                    | <same>
 | big endian               | false                              | <same>                             | <same>                                | <same>                                | <same>                    | <same>
 | runtime params (no gc)   | b=1,H=0,p=0,t=0,v=0,w=1,W=0        | <same>                             | <same>                                | <same>                                | <same>                    | <same>
 | backend type             | native                             | <same>                             | <same>                                | <same>                                | <same>                    | <same>
 | ocaml version            | 4.14.0                             | 4.14.0                             | 4.14.0                                | 4.14.0                                | 4.12.1                    | 4.12.1
 | word size                | 64 bits                            | <same>                             | <same>                                | <same>                                | <same>                    | <same>
 | Start Time               | 2022/06/16 13:27:55 (GMT)          | 2022/06/16 19:47:03 (GMT)          | 2022/06/16 14:41:02 (GMT)             | 2022/06/16 20:32:41 (GMT)             | 2022/07/04 17:37:23 (GMT) | 2022/07/05 07:42:00 (GMT)
 | Store Type               | pack                               | <same>                             | <same>                                | <same>                                | <same>                    | <same>
 | gc.minor_heap_size       | 262144                             | <same>                             | <same>                                | <same>                                | <same>                    | <same>
 | gc.major_heap_increment  | 15                                 | <same>                             | <same>                                | <same>                                | <same>                    | <same>
 | gc.space_overhead        | 120                                | 120                                | 120                                   | 120                                   | 80                        | 80

The OCaml manual has this for gc.space_overhead: The major GC speed is computed from this parameter. This is the memory that will be "wasted" because the GC does not immediately collect unreachable blocks. It is expressed as a percentage of the memory used for live data. The GC will work more (use more CPU time and collect blocks more eagerly) if space_overhead is smaller. Default: 120

So if we just look at the number, it seems that good-gc should have a more aggressive GC, and so should ideally use less memory.

However, in OCaml 4.13.0, GC was changed to best-fit, and various other GC improvements were made. So, again, we can't just compare good-gc directly to the others, when considering max memory usage.

Jul 19 '22 11:07 tomjridge

Another comparison: tree benchmark, with gc, irmin main, but ocaml 4.12 vs 4.14. Max memory usage was 1.335G in 4.12, 1.212G in 4.14 (so not much difference).

Aside: Wall time was 13m28s vs 11m01s, so 4.14 seems about 20% quicker on this benchmark.

main412-main414.txt

Jul 19 '22 12:07 tomjridge

At this point, it seems that the tree benchmarks do not exhibit 3x memory usage (at least in the runs detailed above). So investigation should continue with the lib_context benchmarks instead.

Jul 19 '22 12:07 tomjridge

Something else we should be careful at, is that I think the max memory usage reported by all our benchmarks are for the main process. I started a bench which record the maxrss (Mb) for the Gc child process at the beginning and at end of Gc.run:

 begin 131 end 1040 (gc 1)
 begin 235 end 1791
 begin 344 end 1945
 begin 399 end 2166
 begin 545 end 2130
 begin 619 end 2434
 begin 627 end 2442 (gc 7)

the begin value corresponds approx to the maxrss of the parent process.

Jul 19 '22 15:07 icristescu

Are you saying that the main process uses 133MB maxrss at the point the child is forked, and that the child process then gets to 2442MB? The child process is supposed to use almost no memory at all (this was part of the design requirements, and was satisfied I believe in the previous version of the code). Do you think it is possible the new "reachable" calculation has changed the memory usage? (I think there is a hashtable there that perhaps grows very large?)

Jul 19 '22 15:07 tomjridge

Memory in the main process is now in a good spot. See https://github.com/mirage/irmin/issues/1959#issuecomment-1219810473

Tracking GC stats for benchmarking here https://github.com/mirage/irmin/issues/2046

Aug 18 '22 18:08 metanivek