simple-saca
simple-saca copied to clipboard
Use sort_by_cached_key
For small context (up to 496bp) it seems that sort_by_cached_key which reads
and stores the context is up to 3 times faster than doing sort_by with the
custom comparator having to do the 124bp aligned reads for each comparison.
Note that this compares [[u64; 4]; CTX] one u64 at a time. Using SIMD for
this (by giving Key a custom PartialOrd implementation) may be more efficient.
Some numbers for the sorting time on my 6-core machine. (Excluding the ~50s spent on everything else.)
sort_by, k=10
- context 124:
145s - context 248:
147s - context 496:
150s - context 992:
158s
sort_by_cached_key, k=10
- context 124:
44s - context 248:
56s - context 496:
89s - context 992:
162s
For short context, the total time goes down from ~200s to ~100s, so 2x overall speedup :)
Wow thanks! Testing on a 32 vCPU AWS machine, the comparison sort time is reduced to 16s and the overall suffix array construction time is reduced to 33s!
I'll probably make this optional to allow trading off between memory usage and run time.
I also tried unstable sort again, and its 40% faster than stable sort on the AWS machine!
I also tried glidesort and got a similar speedup, probably due to the pattern-defeating techniques used in both sorting algorithms.
I pushed some more commits.
sort_by_cached_keysorts pairs(key, idx in vec to be sorted), but in this case it's more efficient to directly sort(key, suffix array idx), so that we don't have to apply the permutation to the slice-to-be-sorted at the end. To fix this, I inlinedsort_by_cached_keyand modified it a bit.- Since unstable sort is OK, we can simplify the comparator to only check
key, notidx. - I made the allocation thread-local. I don't think this actually matters too much, but feels nice. (OTOH, they now keep the size of the largest bucket that was sorted so this does increase memory usage slightly, and the code is also a bit more complex.)
- The sorting was wrong (see also #4). I changed it to sorting using SIMD methods which is the same speed but gives correct results instead. This does mean that SIMD comparison doesn't actually give speedup, so instead of doing the reverse packing I think simpler forward packing with comparison per u64 may be simpler and as fast.
New timings:
- context 124:
36s(from44sforsort_by_cached_key) - context 248:
53s(from56s) - context: 496:
92s(from89s:thinking: ) - context: 992:
200s(from162s:thinking:; probably CPU throttling)
Did some statistics on the LCP of consecutive suffix array elements. Looking at this, bucketing by 9 elements twice followed by comparison based sorting may be promising. That way most of the elements with LCP<=18 are already sorted and we only need comparisons for the last ~20% of elements.
Cumulative distribution of LCP length:
SIMD comparison might still be fasters for longer contexts? Also you still need to reverse pack with u64 too, if you want to use the comparison operator on the u64s directly (eg., a > b).