rhsort Would you be willing to benchmark agains Julia's default sorting algorithm?

@jariji shared the existence of rhsort with me here https://github.com/LilithHafner/InterLanguageSortingComparisons/issues/10. I ran the benchmarks in gcc bench.c -o bench && ./bench and what I think are equivalent benchmarks in Julia. I found that Julia's default sorting is 2x faster than rhsort in that particular case. If you're willing I'd appreciate your confirmation to refutation of those results. I benchmarked Julia's sorting with this code snippet that can be pasted into the Julia REPL:

using BenchmarkTools, Random, Printf
for n in 10 .^ (3:6)
    v = rand(Int32, n)
    result = @benchmark sort($v) setup=(rand!($v)) evals=1 gctrial=false samples=3e7/n
    @printf "Testing size %8i: best: %6.3f avg: %6.3f ns/v\n" n minimum(result.times)/n mean(result.times)/n
end

Jun 09 '23 16:06 LilithHafner

Thanks for reaching out! I'm happy to work with you to get accurate benchmarks and I'll see if I can get the Julia code running a little later. Note that SingeliSort is a newer effort that includes RH sort for 4-byte integers. There's some stuff I'd like to add but all the sorts should be correct (not well tested). See the exports at the bottom of compiled sort.c; rhsort32 is also provided and is the same Robin Hood algorithm with better merging.

I don't see an -O3 in your compilation, which makes nearly a 10x difference for me at the smaller sizes. A friend did timings on an M2: plot linked here shows performance slightly better than your Julia benchmarks for 1e3 and 1e4.

Jun 09 '23 16:06 mlochbaum

I don't see an -O3 in your compilation

Thanks! That bridges the gap, they now trade off depending on input size on this benchmark with this faster for smaller sizes and Julia faster for larger sizes. Sorting mid-sized arrays is a weakness of Julia's sorting algorithms, and I'm glad to see an algorithm that might be faster in that case.

x@x rhsort % gcc -O3 bench.c -o bench && ./bench && git rev-parse HEAD
In file included from bench.c:80:
./rhsort.c:170:25: warning: while loop has empty body [-Wempty-body]
  while (aux[--sz] == s); sz++;
                        ^
./rhsort.c:170:25: note: put the semicolon on a separate line to silence this warning
1 warning generated.
Sorting random 4-byte integers: rhsort
Testing size     1000: best:  3.000 avg:  4.213 ns/v
Testing size    10000: best:  3.200 avg:  3.443 ns/v
Testing size   100000: best:  4.460 avg:  4.730 ns/v
Testing size  1000000: best:  5.804 avg:  6.521 ns/v
944ad9dc4b8f0d16dcdb471b9eaf99856c3fd0f0
x@x rhsort % julia -e 'using BenchmarkTools, Random, Printf, InteractiveUtils
       for n in 10 .^ (3:6)
           v = rand(Int32, n)
           result = @benchmark sort($v) setup=(rand!($v)) evals=1 gctrial=false samples=3e7/n
           @printf "Testing size %8i: best: %6.3f avg: %6.3f ns/v\n" n minimum(result.times)/n mean(result.times)/n
       end
       versioninfo()'
Testing size     1000: best:  5.208 avg:  7.033 ns/v
Testing size    10000: best:  4.517 avg:  5.713 ns/v
Testing size   100000: best:  4.608 avg:  5.317 ns/v
Testing size  1000000: best:  4.972 avg:  5.342 ns/v
Julia Version 1.9.1
Commit 147bdf428cd (2023-06-07 08:27 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

Jun 09 '23 17:06 LilithHafner

Here's a run on a rather slow CPU; can also confirm that Julia 1.9.0 is >2x faster than 1.8.5!

$ gcc -O3 bench.c -o bench && ./bench
Sorting random 4-byte integers: rhsort
Testing size     1000: best:  4.819 avg:  5.258 ns/v
Testing size    10000: best:  6.198 avg:  6.550 ns/v
Testing size   100000: best:  8.861 avg:  9.346 ns/v
Testing size  1000000: best: 26.642 avg: 26.866 ns/v

$ julia -e 'using BenchmarkTools, Random, Printf, InteractiveUtils
       for n in 10 .^ (3:6)
           v = rand(Int32, n)
           result = @benchmark sort($v) setup=(rand!($v)) evals=1 gctrial=false samples=3e7/n
           @printf "Testing size %8i: best: %6.3f avg: %6.3f ns/v\n" n minimum(result.times)/n mean(result.times)/n
       end
       versioninfo()'
Testing size     1000: best: 14.962 avg: 19.190 ns/v
Testing size    10000: best: 13.678 avg: 16.304 ns/v
Testing size   100000: best: 14.307 avg: 16.554 ns/v
Testing size  1000000: best: 17.982 avg: 19.801 ns/v
Julia Version 1.9.0
Commit 8e63055292* (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 4 virtual cores

Jun 09 '23 17:06 mlochbaum

The idea with SingeliSort is to use a top-down quicksort, and switch over to Robin Hood when the array gets small enough and a sample indicates that the distribution is even enough (several other base cases cover situations where this doesn't happen). It's intended to eventually be the sorting algorithm in CBQN implementing BQN, although I don't know when that'll happen. There are some logistical issues with using the repository because there are components that should be shared with the rest of the interpreter instead of duplicating them.

I just pushed an updated compiled/sort.c so that's on the latest code (although there's hardly any functional change from the last recompile). Should be possible to run benchmarks from a fresh clone like so:

$ ln -s compiled/sort.c .
$ gcc -O3 bench.c -o bench && ./bench
Sorting random 4-byte integers: singelisort
Testing size     1000: best:  5.317 avg:  6.343 ns/v
Testing size    10000: best:  6.460 avg:  7.682 ns/v
Testing size   100000: best:  9.762 avg: 10.525 ns/v
Testing size  1000000: best: 13.279 avg: 13.536 ns/v

And a saved graph that compares to RH: rand

Jun 09 '23 17:06 mlochbaum

I see the runtime spike for rhsort a little later (somewhere between 1e6 and 1e7) so for me rhsort is faster all the way up to 1e6, but I do get results that are pretty similar to that figure.

x@x rhsort % git diff
diff --git a/bench.c b/bench.c
index 6450879..515176c 100644
--- a/bench.c
+++ b/bench.c
@@ -144,7 +144,7 @@ int main(int argc, char **argv) {
   printf("Sorting %s: %s%s%s\n", datadesc, bravery, quadness, sortname);
   // Command-line arguments are max or min,max
   // Inclusive range, with sizes 10^n tested
-  U min=3, max=6; int ls=0;
+  U min=3, max=7; int ls=0;
   if (argc>1) {
     ls = argv[1][0]=='l';
     if (ls) {
x@x rhsort % gcc -O3 bench.c -o bench && ./bench
In file included from bench.c:80:
./rhsort.c:170:25: warning: while loop has empty body [-Wempty-body]
  while (aux[--sz] == s); sz++;
                        ^
./rhsort.c:170:25: note: put the semicolon on a separate line to silence this warning
1 warning generated.
Sorting random 4-byte integers: rhsort
Testing size     1000: best:  5.000 avg:  7.816 ns/v
Testing size    10000: best:  4.300 avg:  5.189 ns/v
Testing size   100000: best:  5.180 avg:  6.215 ns/v
Testing size  1000000: best:  5.824 avg:  6.499 ns/v
Testing size 10000000: best: 10.814 avg: 10.814 ns/v
x@x rhsort % cd ../SingeliSort 
x@x SingeliSort % git diff         
diff --git a/bench.c b/bench.c
index ff11846..4d1391a 100644
--- a/bench.c
+++ b/bench.c
@@ -53,7 +53,7 @@ int main(int argc, char **argv) {
   printf("Sorting %s: %s\n", datadesc, sortname);
   // Command-line arguments are max or min,max
   // Inclusive range, with sizes 10^n tested
-  U min=3, max=6; int ls=0;
+  U min=3, max=7; int ls=0;
   if (argc>1) {
     ls = argv[1][0]=='l';
     if (ls) {
x@x SingeliSort % gcc -O3 bench.c -o bench && ./bench
Sorting random 4-byte integers: singelisort
Testing size     1000: best:  5.000 avg:  6.119 ns/v
Testing size    10000: best:  3.700 avg:  4.469 ns/v
Testing size   100000: best:  5.310 avg:  5.725 ns/v
Testing size  1000000: best:  6.361 avg:  6.416 ns/v
Testing size 10000000: best:  7.700 avg:  7.700 ns/v
x@x SingeliSort % julia -e 'using BenchmarkTools, Random, Printf, InteractiveUtils
       for n in 10 .^ (3:7)
           v = rand(Int32, n)
           result = @benchmark sort!($v) setup=(rand!($v)) evals=1 gctrial=false samples=3e7/n
           @printf "Testing size %8i: best: %6.3f avg: %6.3f ns/v\n" n minimum(result.times)/n mean(result.times)/n
       end
       versioninfo()'
Testing size     1000: best:  5.041 avg:  6.171 ns/v
Testing size    10000: best:  4.533 avg:  5.296 ns/v
Testing size   100000: best:  4.548 avg:  5.091 ns/v
Testing size  1000000: best:  4.907 avg:  5.294 ns/v
Testing size 10000000: best:  5.747 avg:  6.427 ns/v
Julia Version 1.9.1
Commit 147bdf428cd (2023-06-07 08:27 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M2
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

I expect that it is possible to significantly outperform Julia on massive arrays as Julia uses an LSB radix sort with bad cache locality while a divide and conquer top down approach (MSB-radix sort) followed by LSB radix sort for the base case is better in that case.

These benchmarks make me believe that robin sort is plausibly better than radix sort as a base case—though I'd need to do benchmarks on data that is not drawn from a uniform distribution to know for sure.

Do you know why rhsort has that runtime spike at higher sizes?

Jun 09 '23 18:06 LilithHafner

./bench 7 will also run up to 1e7, and you can do a finer benchmark with ./bench l 40.

Since RH on random data does random writes, the spike comes from running out of L3 cache. The M2's L3 is huge so it can sort bigger arrays.

256-way LSB radix (CBQN's current 4-byte sorting algorithm, incidentally) has very good locality: it reads from one region and writes to 256 at a time, which cache can easily keep up with. Unless the spacing causes it to have set associativity issues as described here. I guess I don't know what happens when it stops fitting in RAM but I've never seen an issue with large sizes in radix sort.

The issue with using radix sort as a base case is more that it ignores any work that's already been done. SingeliSort does use a 2-byte radix sort when a partition falls into 2-byte range, implemented here. A nice thing about this is that it only needs to keep 2 bytes per element so it's mostly in place. It's a little faster than RH and it'll end up being used on inputs where the range is small enough relative to the length, which will always be true for large enough arrays.

Jun 09 '23 18:06 mlochbaum

You may also be interested in CBQN's current radix sort? For me it's coming up substantially faster than Julia, although surely this is somewhat architecture-dependent. Looking at radix_sort in sort.jl, I do see that it's doing a separate counting pass before each round of radix sorting, while doing all the counts and sums in one pass as in ska_sort is faster.

Building should be quick: clone CBQN; run $ make o3n; use ./BQN instead of bqn below. There's AVX2 code for the sums which matters for smaller arrays (hundreds) but nothing for ARM.

Here are my timings, showing log_10(n) and average time in ns/v. It uses random numbers in [0,2e9) instead of full-range but that's not going to change anything.

$ bqn -e "{n←10⋆𝕩 ⋄ •Show 𝕩 ⋈ n÷˜1e9× (3e7÷n) ∧•_timed n •rand.Range 2e9}¨ 3↓↕8"
⟨ 3 6.4459191 ⟩
⟨ 4 8.3309357 ⟩
⟨ 5 8.956033933333334 ⟩
⟨ 6 10.101803833333333 ⟩
⟨ 7 11.0162912 ⟩

Source is very ugly (blame C): 32-bit sort and radix sum. The Singeli version is laid out a lot better if the unusual language isn't an issue.

Jun 09 '23 19:06 mlochbaum

256-way LSB radix (CBQN's current 4-byte sorting algorithm, incidentally) has very good locality

Good point, as long as the hardware supports that many read/write heads efficiently. That explains some performance characteristics I've observed.

I guess I don't know what happens when it stops fitting in RAM

Yeah, that's the only time I've come across issues. When sorting something larger than the amount of available physical memory, the OS can support other algorithms with swap memory more effectively than radix sort.

Looking at radix_sort in sort.jl, I do see that it's doing a separate counting pass before each round of radix sorting, while doing all the counts and sums in one pass as in ska_sort is faster.

When I originally wrote that code I experimented with counting ahead of time and per-pass and found per-pass to be faster. That could have been due to using 64-bit count variables (I also saw no speedup switching from 64-bit count variables to narrower bit-widths). However, I was pretty new to Julia at that time, so I may have made some oversights in that analysis. Plus, the existence of this faster implementation in CQBN means I must be missing something.

These performance characteristics are very strange

   {n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed ∧ n •rand.Range 2e10) × 1e9 ÷ n }¨ 3↓↕8 # presorted 64-bit integers
⟨ 2.6931666666666665 1.6032666666666666 1.5909 1.627 1.9460333333333333 ⟩
   {n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed n •rand.Range 2e9) × 1e9 ÷ n }¨ 3↓↕8 # random 32-bit integers
⟨ 2.8407999999999998 2.5551666666666666 2.7207000000000003 2.8192666666666666 4.058433333333333 ⟩
   {n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed ∧ n •rand.Range 2e9) × 1e9 ÷ n }¨ 3↓↕8 # presorted 32-bit integers
⟨ 4.8028 3.5467 3.6523333333333334 3.7621333333333333 4.7351 ⟩
   {n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed n •rand.Range 2e10) × 1e9 ÷ n }¨ 3↓↕8 # random 64-bit integers
⟨ 50.302800000000005 83.17146666666666 140.1796 139.31573333333333 158.00593333333333 ⟩
   {n←10⋆𝕩 ⋄ ((3e7÷n) ∧•_timed ∧ n •rand.Range 2e10) × 1e9 ÷ n }¨ 3↓↕8 # presorted 64-bit integers
⟨ 1.6696666666666666 1.5579 1.5874666666666666 1.6270333333333333 1.7827333333333333 ⟩

however, it seems to me that the bqn implementation for random 32-bit integers is outstanding:

x@x SingeliSort % gcc bench.c -O3 && ./bench 8                                    
Sorting random 4-byte integers: singelisort
Testing size     1000: best:  3.000 avg:  3.976 ns/v
Testing size    10000: best:  2.900 avg:  3.387 ns/v
Testing size   100000: best:  5.000 avg:  5.219 ns/v
Testing size  1000000: best:  6.355 avg:  6.385 ns/v
Testing size 10000000: best:  7.664 avg:  7.664 ns/v
Testing size 100000000: best:  8.959 avg:  8.959 ns/v
x@x SingeliSort % julia -e 'using BenchmarkTools, Random, Printf                  
       for n in 10 .^ (3:8)
           v0 = rand(Int32(1):Int32(2e9), n)
           v = copy(v0)
           result = @benchmark sort!($v) setup=(copyto!($v, $v0)) evals=1 gctrial=false samples=2e8/n
           @printf "Testing size %8i: best: %6.3f avg: %6.3f ns/v\n" n minimum(result.times)/n mean(result.times)/n
       end'
Testing size     1000: best:  4.958 avg:  5.883 ns/v
Testing size    10000: best:  4.537 avg:  5.021 ns/v
Testing size   100000: best:  4.572 avg:  4.978 ns/v
Testing size  1000000: best:  4.871 avg:  5.118 ns/v
Testing size 10000000: best:  5.777 avg:  5.922 ns/v
Testing size 100000000: best:  6.270 avg:  6.333 ns/v
x@x SingeliSort % bqn
   {n←10⋆𝕩 ⋄ ((2e8÷n) ∧•_timed n •rand.Range 2e9) × 1e9 ÷ n }¨ 3↓↕9 # random 32-bit integers
⟨ 2.98448 2.624125 2.8187550000000003 2.7720599999999997 3.9261299999999997 6.192765 ⟩

Pardon my delay, it can take me hours to pick up a new language /h

Jun 09 '23 20:06 LilithHafner

CBQN doesn't have 64-bit ints: BQN specifies a single numeric type, which is double-precision floats for most implementations. So 32-bit ints and smaller types are an optimization. Like many things •rand.Range uses the smallest type that fits (and •internal.Type returns the storage type of its argument). If it's larger than 32-bit range you'll get floats, and for these we're currently just using the generic case which is stock Timsort. Plan is to convert 64-bit floats to integers and then use integer sorting but I think we'll only do that when moving to SingeliSort.

It's also a bit embarrassing that we're not using the sortedness flags to cheat on a pre-sorted argument, so expect that to change some day.

I don't think a different count type changes anything, except at small sizes, where zeroing the count array and doing sums is the main cost (hopefully Julia's cumsum is vectorized?). The important part of counting ahead of time is that you read one value and write four counts. That needs to be unrolled so I could imagine some trickiness in getting Julia to compile it right.

That's an annoying amount of performance we'll be giving up for adaptivity with SingeliSort, in the >1e5 range. No idea what to do about it though, as radix sort is all-or-nothing. Julia being able to provide multiple algorithms is nice for that.

Jun 09 '23 21:06 mlochbaum