Saving more space during construction.
This is an addition to PR #211 .
It replace another relabeling data structure and replaces it with a binary search on the accumulated sizes of the sequences. The created index doesn't change (creates identical binary).
Memory: for 20GB input and sampling Rate of 5, I am expecting to additionally save 32GB of memory during construction.
Runtime: This is very hard to predict. Purely theoretically we replace O(n) with O(n * log(m)), where m is the number of sequences. In my test (386MB input) I have seen a runtime reduction of ~10%.
Update: There was another structure that I could reuse, which would be an additional 32GB of savings. This should lead in a total of 64GB less memory...With this lambda3 might be the same in memory usage to lambda2
Runtime: This is very hard to predict. Purely theoretically we replace O(n) with O(n * log(m)), where m is the number of sequences. In my test (386MB input) I have seen a runtime reduction of ~10%.
Is this a change that affects the construction or also the search?
edit: I assume it is just for construction, because the on-disk format doesn't change..?
I did some tests and RAM decreased to 170GB from previous 214 GB, but the runtime increased again from previous 40 minutes to now 1:10h. I think we probably need some more tests here and therefore might need to push this a bit back due to our current benchmarking efforts.
This should produce an identical index, resulting in no impact of the search. The construction itself will have a different runtime, which is hard to predict. An increase of 30min is surprising, this would mean that the majority time is spend outside of the suffix array construction. I will investigate with some larger data and see how it behaves.