Have batch merges

Open pascutto opened this issue 4 years ago • 0 comments

The most resource intensive operation in index is currently the merge operation, which combines the sorted in-memory log with the on-disk data into the next sorted data file. It basically loops over both file, scanning their head, and writing the smallest value into the new file.

Issue

However, due to the size of the index in the context of irmin-pack, the data file is many orders of magnitude bigger than the merged log. In this setting, most of the merge loop is actually a loop over data bindings only, which have to be read, decoded, compared, and rewritten.

In steady state, about 10 pages of data are unnecessary scanned this way, for each single log entry inserted, which is obviously a waste of resources.

Fixes

The merge function should skip the irrelevant pages of data and just copy them as-is when necessary.

Whether a page (or generally a range) is irrelevant to scan could be decided using either the fanout or the interpolation search* we already have, to find the next location where the log entry should be inserted, instead of scanning data linearly until then.

Note that using the interpolation search is more precise, but also induces more computations, and potentially reads and allocations, which right now cannot be shared with merge, so both would probably have to be experimented and compared.

An important note is that scanning the data file is necessary right now not only to compare against the log entries and insert in sorted order, but also to register the entries in the fanout, so they can later be retrieved. In order to avoid having to register these (previously registered in the former fanout) entries, this optimization depends on #281 being implemented.

Mar 17 '21 13:03 pascutto