highway vqsort repro by djb

Thanks for making me aware of https://mobile.twitter.com/hashbreaker/status/1533201734369103872, and to djb for working to reproduce our results. I propose to move discussion here.

timed Sorter() as ~8000 cycles for int32[256] (big chunk of code for a size-specific sorting network)

Section 3 describes vqsort's size-256 sorting network

Note that our sorting network size is actually 16 * elements_per_vector, i.e. 128 in this case. Thus what is being compared is djbsort's O(1) sorting network, vs our full quicksort with pivot sampling, partitioning, then sorting network.

According to LLVM-MCA, our 128-element network consists of 956 instructions with "total cycles" 299 for "-mcpu=skylake" (438 for Haswell; this does not include any kind of front-end stalls nor cache misses).

Our claim of 'outperforms' is also specific to larger array sizes (the paper provides results for 1M and 100M). I understand some use cases may be interested in smaller arrays.

Thanks to commenters for making us aware of vxsort, which was not published in the form of a paper. We will look into its performance soon.

Jun 07 '22 07:06 jan-wassenberg

The Jun7 microbenchmark seems less relevant to our use cases. First, using wide vectors (AVX2 or AVX-512) for tiny problem sizes can be a pessimization at the system level due to licence-based downclocking or P-state transitions, which will not be visible in microbenchmarks. To be clear: throttling is a non-issue with proper, sustained use of vector instructions for at least ~1ms, after which the resulting time savings dwarf the transition cost. Given the vqsort throughput around 1 GB/s (AVX-512) and 0.6 GB/s (AVX2), the minimum input should be 1 MB (AVX-512) or 600 KiB (AVX2). Not even the rightmost point on the graph meets this criteria.

Perhaps the author is aware of this and is interested in sorting in the context of other code that also uses wide vectors; that would be fine. Perhaps sorting a few numbers is required so often that it is indeed time-critical. In that case, one would not use a Quicksort. For other cases, this microbenchmark is not very informative.

Second, vqsort's pivot selection uses high-quality random sampling of 576 bytes of inputs followed by an iterated median-of-3. Of course this looks much slower for tiny input sizes such as N=256 than vxsort's single median-of-3. However, it saves us from drastic slowdowns for skewed input distributions - even if the heapsort fallback is used, that's going to be >10x slower.

In current vqsort, the base case is (for AVX2) a size-128 sorting network, where vqsort looks slightly better than previous work... because it fully unrolls the code for that size. Can't fit many such sizes in insn caches.

vqsort has only a single network, the unused merge steps for smaller network sizes are skipped via conditional branches.

Jun 09 '22 12:06 jan-wassenberg

I think "vxsort and djbsort are faster for large inputs" is not a claim that accompanies the benchmark.

I think a claim that accompanies the benchmark is that the small array sizes are the base cases for large array sizes, and that vqsort may not perform as well as vxsort or djbsort on small arrays (and that this is not entirely due to short-term differences attributable to license-based downclocking or P-state transitions), which means that vqsort could potentially be even faster than it already is for large inputs if it could solve base cases as fast as vxsort or djbsort.

Is this all incorrect?

Thanks.

Jun 19 '22 17:06 sharpobject

small array sizes are the base cases for large array sizes, and that vqsort may not perform as well as vxsort or djbsort on small arrays (and that this is not entirely due to short-term differences attributable to license-based downclocking or P-state transitions), which means that vqsort could potentially be even faster than it already is for large inputs if it could solve base cases as fast as vxsort or djbsort.

I understand your point as "making our sorting network faster would also benefit larger sizes". However, I have not yet seen evidence that it is slower than any alternative. I still haven't gotten to measuring vxsort, but djb's measurements showed that our SortingNetwork is indeed faster for 128 elements (its actual size for int32 and AVX2). That is also prior to the optimization in #744 inspired by this discussion.

There are indeed extra overheads in other parts of our code, but let's see what they are contributing:

For > 128 elements, the actual Quicksort starts (pivot sampling + partitioning). At the crossover point, the increased overhead is generally going to be worse than a constant-time sorting network (djbsort), though the latter will increasingly underperform for larger inputs due to their asymptotic higher complexity.
Our pivot sampling is larger and more robust than the simple median of 3 used by vxsort. This does come at a cost especially for medium input sizes (for which using wide vectors is a net loss unless the rest of the application is also using them). However, I disagree with the conclusion that stripping out this robust sampling would improve large-input performance. Unbalanced partitions can lead to much lower performance than is saved by simpler sampling.
We are very conservative with AVX2, using codegen that pessimistically assumes masked stores can still fault even if the mask is zero (AMD reserves the right to do so), and copying inputs into a buffer so that the sorting network can always load full vectors. The benefit of this deliberately incurred cost is practical deployability - reducing code size of the network, and avoiding crashes in case future AMD actually do fault (which seems unlikely, but imagine the cost if it does happen).

To summarize: I do believe SortingNetwork is very efficient for 128 elements. For fewer, we pay a cost to pad the input (benefit: smaller code footprint, strict adherence to x86 specs); for more, robust pivot sampling is expensive for small inputs, but worthwhile for large arrays. Does that make sense?

Jun 20 '22 07:06 jan-wassenberg

If some techniques are slow for inputs in a some ranges and fast for inputs in different ranges, maybe they can be applied only to ranges for which they are fast.

Jun 20 '22 08:06 sharpobject

annotated graph dividing vqsort vs competitors cycles-per-byte into regions based on input size and performance

If you can forgive the mess I've made here, I think the situation is like this: In region A, input sizes 2-16, vqsort is 2-4x as slow as competitors, but this really doesn't matter because when sorting larger arrays it will almost never be called on subarrays of these sizes. So region A is fine. In region B, from about 16-128, it's doing super great! wow! In region C, input sizes 129-1024 or so, it is at least 2x as slow as some competitors. In region D, up to about 4096, it is still slower, but this might be mainly due to the slowness caused by subproblems in region C. In region E, it's doing well! We actually care about large inputs, region F, for which the graph shows no data. It seems likely that for input sizes in region F, a large part of the time is spent on subproblems in region C.

It would maybe be good to use a different approach than the current one for a narrow range of input sizes [129,K] for some K and keep the robust pivot sampling for larger input sizes.

Jun 20 '22 09:06 sharpobject

Thanks, that is summary is helpful for discussion. One thing we should clarify is that this plot is generated from a microbenchmark which is repeatedly called for the same length, which leads to very strong branch prediction. This is just about never a good idea because it doesn't reflect reality (lengths would actually be random).

It would maybe be good to use a different approach than the current one for a narrow range of input sizes [129,K] for some K and keep the robust pivot sampling for larger input sizes.

That sounds sensible. But the extra branching and code size does have a cost (sadly often missed by microbenchmarks). I previously had an 'optimization' which did dense sampling for region C-D, which turned out to make things worse.

We also know from profiling data that pivot sampling only accounts for a small fraction of time, whereas BaseCase is more like 15-20%. And it seems there is something we can do there: avoiding almost all padding because it's OK (except at the right border) to sort more than the current sub-array (right?). This seems to be a win in region F, though only 0.5-1% (near the noise floor for measurements) but consistent, so we'll gladly take it. Thank you for bringing it up.

But again, beware microbenchmarks. What looks like a huge win when looking at (irrelevant for our applications) small sizes, turns out to be tiny or possibly even negative in end to end benchmarks for realistic input sizes.

Jun 20 '22 11:06 jan-wassenberg

Super cool! It sounds plausible to me that the extra branches turn out to be too costly and that the sadness in region C was actually due in large part to padding.

Jun 20 '22 12:06 sharpobject

I tried to reproduce the benchmark but I do not have any Skylake machines and results on Haswell seem less weird than the results djb posted.

Jun 20 '22 15:06 sharpobject

Interesting. Can you share the numbers as a table or plot?

Jun 20 '22 15:06 jan-wassenberg

another plot similar to the previous plots, but with less anomalies

This is a Haswell "Intel(R) Xeon(R) CPU E5-1650 15 MiB L3" on which I've temporarily disabled turbo boost for recording the benchmark. The points labeled "vqsort_test_456048805" actually only include the first commit 94a07e8 plus whatever changes I made to get it to compile, so you should probably ignore them.

Jun 20 '22 15:06 sharpobject

Thanks for sharing. FYI 94a07e8 is broken, we should have the fixed version up by tomorrow. Looks like your benchmark has a much lower peak after 2^7.

Jun 20 '22 16:06 jan-wassenberg

Yeah. I hoped to compare the huge peak in 0.17.0 to the peak after all the changes that mention this issue, but I didn't end up with a huge peak in 0.17.0.

Jun 20 '22 16:06 sharpobject

:) FYI the padding optimization has just landed.

Jun 20 '22 16:06 jan-wassenberg

The graph for master looks basically the same as the gold points in the above graph on the machine I used. It does seem to sort correctly now though.

Jun 21 '22 10:06 sharpobject

I've added vxsort to our bench_sort. Results for AVX-512 and AVX2 with 100M keys are:

[ RUN      ] BenchSortGroup/BenchSort.BenchAllSort/AVX3
      AVX3:       vxsort:     u64: uniform32: 1.00E+08  426 MB/s ( 1 threads)
      AVX3:           vq:     u64: uniform32: 1.00E+08  570 MB/s ( 1 threads)

[ RUN      ] BenchSortGroup/BenchSort.BenchAllSort/AVX2
      AVX2:       vxsort:     u64: uniform32: 1.00E+08  247 MB/s ( 1 threads)
      AVX2:           vq:     u64: uniform32: 1.00E+08  356 MB/s ( 1 threads)

vq is about 1.34 times as fast on AVX-512 and 1.44x on AVX2.

There appears to be a bug in vxsort's enable_avx2.h that in fact enables AVX-512 instructions on GCC. The above results are not affected by this bug, in order to compile we had to inline the enable* header anyway.

Jul 01 '22 09:07 jan-wassenberg

Hi @sharpobject, FYI we have a follow-up with new measurements after fixing a performance bug, which was especially affecting short arrays. Happy to discuss if you're interested.

Jun 08 '23 17:06 jan-wassenberg

Closing, feel free to reopen if anyone wants to discuss.

Apr 09 '24 08:04 jan-wassenberg

highway highway copied to clipboard

vqsort repro by djb

highway
highway copied to clipboard