simple-saca icon indicating copy to clipboard operation
simple-saca copied to clipboard

Use sort_by_cached_key

Open RagnarGrootKoerkamp opened this issue 1 year ago • 5 comments

For small context (up to 496bp) it seems that sort_by_cached_key which reads and stores the context is up to 3 times faster than doing sort_by with the custom comparator having to do the 124bp aligned reads for each comparison.

Note that this compares [[u64; 4]; CTX] one u64 at a time. Using SIMD for this (by giving Key a custom PartialOrd implementation) may be more efficient.

Some numbers for the sorting time on my 6-core machine. (Excluding the ~50s spent on everything else.)

sort_by, k=10

  • context 124: 145s
  • context 248: 147s
  • context 496: 150s
  • context 992: 158s

sort_by_cached_key, k=10

  • context 124: 44s
  • context 248: 56s
  • context 496: 89s
  • context 992: 162s

For short context, the total time goes down from ~200s to ~100s, so 2x overall speedup :)

RagnarGrootKoerkamp avatar Oct 27 '23 12:10 RagnarGrootKoerkamp

Wow thanks! Testing on a 32 vCPU AWS machine, the comparison sort time is reduced to 16s and the overall suffix array construction time is reduced to 33s!

I'll probably make this optional to allow trading off between memory usage and run time.

Daniel-Liu-c0deb0t avatar Oct 30 '23 04:10 Daniel-Liu-c0deb0t

I also tried unstable sort again, and its 40% faster than stable sort on the AWS machine!

I also tried glidesort and got a similar speedup, probably due to the pattern-defeating techniques used in both sorting algorithms.

Daniel-Liu-c0deb0t avatar Oct 30 '23 04:10 Daniel-Liu-c0deb0t

I pushed some more commits.

  1. sort_by_cached_key sorts pairs (key, idx in vec to be sorted), but in this case it's more efficient to directly sort (key, suffix array idx), so that we don't have to apply the permutation to the slice-to-be-sorted at the end. To fix this, I inlined sort_by_cached_key and modified it a bit.
  2. Since unstable sort is OK, we can simplify the comparator to only check key, not idx.
  3. I made the allocation thread-local. I don't think this actually matters too much, but feels nice. (OTOH, they now keep the size of the largest bucket that was sorted so this does increase memory usage slightly, and the code is also a bit more complex.)
  4. The sorting was wrong (see also #4). I changed it to sorting using SIMD methods which is the same speed but gives correct results instead. This does mean that SIMD comparison doesn't actually give speedup, so instead of doing the reverse packing I think simpler forward packing with comparison per u64 may be simpler and as fast.

New timings:

  • context 124: 36s (from 44s for sort_by_cached_key)
  • context 248: 53s (from 56s)
  • context: 496: 92s (from 89s :thinking: )
  • context: 992: 200s (from 162s :thinking:; probably CPU throttling)

RagnarGrootKoerkamp avatar Oct 30 '23 11:10 RagnarGrootKoerkamp

Did some statistics on the LCP of consecutive suffix array elements. Looking at this, bucketing by 9 elements twice followed by comparison based sorting may be promising. That way most of the elements with LCP<=18 are already sorted and we only need comparisons for the last ~20% of elements.

Cumulative distribution of LCP length: stats

RagnarGrootKoerkamp avatar Oct 30 '23 14:10 RagnarGrootKoerkamp

SIMD comparison might still be fasters for longer contexts? Also you still need to reverse pack with u64 too, if you want to use the comparison operator on the u64s directly (eg., a > b).

Daniel-Liu-c0deb0t avatar Oct 31 '23 22:10 Daniel-Liu-c0deb0t