irmin-pack: GC suffix copy in child process appears slow
In the child process, the copy to the new suffix should be iterated a number of times (or until there are few bytes left to copy).
There is a branch that does this here: https://github.com/zshipko/irmin/tree/transfer-loop
When running the "tree.exe" benchmarks: dune exec ./bench/irmin-pack/tree.exe -- ./data4_100066commits.repr --artefacts=latest the following was observed:
+044us [INFO] index [root] sync
+044us [INFO] application XXX: transfer_loop new length 152483
+044us [INFO] index [root] sync
+044us [INFO] application XXX: transfer_loop new length 36526
+044us [INFO] index [root] sync
+044us [INFO] application XXX: transfer_loop new length 7620
+044us [INFO] index [root] sync
+044us [INFO] application XXX: transfer_loop new length 15160
+044us [INFO] index [root] sync
+044us [INFO] application XXX: transfer_loop new length 9326
+044us [INFO] index [root] sync
+044us [INFO] application XXX: transfer_loop new length 11704
+044us [INFO] index [root] sync
+044us [INFO] application XXX: transfer_loop new length 20629
+044us [INFO] application XXX: transfer_loop ran to end
What this means is that the parent is writing to the suffix faster than the child can copy. I don't think this was observed in the old code: there, my memory is that the raw copy in the child was much faster than the writes to the suffix in the parent (after all, one is a raw copy of bytes, whereas the other involves marshalling objects etc etc). It may point to poor performance in the "copy to suffix" code (or in the IO layer generally).
Related to #1959
Attached is a detailed log from running Zach's transfer-loop branch pr1975.log
Zach reported that changing the gap_tolerance seemed to limit the number of transfer_loop iterations to a small number (2 or 3). It is not clear why gap_tolerance would affect the speed of the suffix copy though. Tom is currently investigating this.
Here is a log with gap_tolerance set to 1000, it looks like in this case transfer_loop runs 1 iteration each time: out.log
Within the transfer loop, Zach's code executes Fm.reload. This is likely to be slow. However, in order to pick up a possible file length change, with the current code, this is needed. (In the original code, working with raw files, we would just fstat to get the length each time, which is very quick.)
This is the reason why the copy loop is "slow" compared to the main process.
This may be the reason why (in main branch) there is no loop when copying the suffix. However, a loop is surely needed (if the suffix is large, the copy takes a long time, the suffix grows a lot in the meantime, and if we don't copy again in the GC process, we have to do it in the main process, which will take some time).
So the question is: what to do now?
One thing is to confirm with Victor (or our own testing) that the main process blocks for a long time, when the suffix is large (as in the Tezos usecase) and when the copy loop is not iterated. It should be clearly observable that when the GC process finishes and the main process takes over, there is time spent in the additional suffix copy.
[Aside: for these performance bugs, we need some way of recording the time spent on particular code paths, so that we can later analyse; at the moment the use of a few printf's is ad-hoc and doesn't currently give good coverage of the paths we are interested in.]
Then one "fix" would be to accept that the Fm.reload takes time, so that the copy loop in the child may not be faster than the main process (when reading/updating the suffix). So we want to avoid the situation where the copy loop calls Fm.reload to find a small number of bytes have been written, copies these bytes, and then repeats; we don't mind if a small number of bytes - say 4096*5 - needs to be copied in the main process; we just want to avoid the "bad" interaction where we repeatedly copy a small number of bytes. I think Zach's current code does set a limit on the min number of bytes necessary to cause a copy. So something like Zach's code looks like the correct fix.
This has been addressed with mmap and transfer loop.