strobealign icon indicating copy to clipboard operation
strobealign copied to clipboard

Peak memory usage may be suboptimal

Open marcelm opened this issue 1 year ago • 2 comments

I’ve done some measurements of how much memory StrobeAlign uses depending on the reference size (by truncating a single reference to varying lengths). This is the resulting pattern: mem There’s a step at around 50, 100, 200, 400. I had to stop at a reference size of 570 Mbp because I did this on my machine at home which doesn’t have RAM more longer references.

The pattern could be explained by an std::vector that is being appended to. (Since an std::vector doubles its size when it needs to grow.) There are lot of calls to std::vector.reserve() already, perhaps one of them needs to be adjusted or we need one more.

marcelm avatar Sep 10 '22 21:09 marcelm

Ok, it’s the mers_index (hash map of type robin_hood). The mers_index.load_factor() is 0.77 at 90 Mbp and then goes down to 0.42 for 100 Mbp. I’m not sure, but does the robin_hood implementation only allow hash table sizes that are powers of two? Then there’s little we can do.

marcelm avatar Sep 10 '22 22:09 marcelm

Agreed with your last comment.

The initial allocation of the flat_vector is based on a prediction of the number of seeds that will be created from the reference. The prediction is roughly given by seed_thinning_factor * ref_length. Therefore, it should be fairly good memory allocation.

I don't know about the intrinsic of the robin hood hash table. It is a bit memory consuming and I read somewhere that 2/3 of the memory is consumed with storing 'internal nodes'. The hash table is indeed the most consuming one in strobealign. This hash table benchmark from 2022 (found by @johan-gson) may suggest viable alternatives, if any.

ksahlin avatar Sep 12 '22 09:09 ksahlin