strobealign
strobealign copied to clipboard
Peak memory usage may be suboptimal
I’ve done some measurements of how much memory StrobeAlign uses depending on the reference size (by truncating a single reference to varying lengths). This is the resulting pattern:
There’s a step at around 50, 100, 200, 400. I had to stop at a reference size of 570 Mbp because I did this on my machine at home which doesn’t have RAM more longer references.
The pattern could be explained by an std::vector
that is being appended to. (Since an std::vector
doubles its size when it needs to grow.) There are lot of calls to std::vector.reserve()
already, perhaps one of them needs to be adjusted or we need one more.
Ok, it’s the mers_index
(hash map of type robin_hood
). The mers_index.load_factor()
is 0.77 at 90 Mbp and then goes down to 0.42 for 100 Mbp. I’m not sure, but does the robin_hood
implementation only allow hash table sizes that are powers of two? Then there’s little we can do.
Agreed with your last comment.
The initial allocation of the flat_vector is based on a prediction of the number of seeds that will be created from the reference. The prediction is roughly given by seed_thinning_factor * ref_length
. Therefore, it should be fairly good memory allocation.
I don't know about the intrinsic of the robin hood hash table. It is a bit memory consuming and I read somewhere that 2/3 of the memory is consumed with storing 'internal nodes'. The hash table is indeed the most consuming one in strobealign. This hash table benchmark from 2022 (found by @johan-gson) may suggest viable alternatives, if any.