lambda icon indicating copy to clipboard operation
lambda copied to clipboard

Saving more space during construction.

Open SGSSGene opened this issue 2 years ago • 3 comments

This is an addition to PR #211 .

It replace another relabeling data structure and replaces it with a binary search on the accumulated sizes of the sequences. The created index doesn't change (creates identical binary).

Memory: for 20GB input and sampling Rate of 5, I am expecting to additionally save 32GB of memory during construction. Runtime: This is very hard to predict. Purely theoretically we replace O(n) with O(n * log(m)), where m is the number of sequences. In my test (386MB input) I have seen a runtime reduction of ~10%.

Update: There was another structure that I could reuse, which would be an additional 32GB of savings. This should lead in a total of 64GB less memory...With this lambda3 might be the same in memory usage to lambda2

SGSSGene avatar May 09 '23 08:05 SGSSGene

Runtime: This is very hard to predict. Purely theoretically we replace O(n) with O(n * log(m)), where m is the number of sequences. In my test (386MB input) I have seen a runtime reduction of ~10%.

Is this a change that affects the construction or also the search?

edit: I assume it is just for construction, because the on-disk format doesn't change..?

h-2 avatar May 10 '23 12:05 h-2

I did some tests and RAM decreased to 170GB from previous 214 GB, but the runtime increased again from previous 40 minutes to now 1:10h. I think we probably need some more tests here and therefore might need to push this a bit back due to our current benchmarking efforts.

sarahet avatar May 11 '23 07:05 sarahet

This should produce an identical index, resulting in no impact of the search. The construction itself will have a different runtime, which is hard to predict. An increase of 30min is surprising, this would mean that the majority time is spend outside of the suffix array construction. I will investigate with some larger data and see how it behaves.

SGSSGene avatar May 11 '23 14:05 SGSSGene