highway Enable usage of sorter in parallel sorting algorithms

I'd like to use the highway sort in a parallel quicksort implementation. So far I've used the highway sorter in the base case of the parallel quicksort, since only sort is exposed. This alone makes a huge difference: it's ~30% faster than tbb::parallel_sort for ~700M uint64s on a 12-thread laptop with 60% lower CPU usage.

I'd like to also use the vectorized partitioner in contrib/sort in the earlier stages. I estimate this could save up to another ~30% wall time.

If you are interested in accepting contributions to enable this, these are the two strategies I was considering:

Expose the partitioner in the Sorter API, or perhaps in a separate API;
Provide the sort with a std::function<void(void())> callback for scheduling sub-stages larger than a threshold to run potentially asynchronously.

the latter is probably an overall simpler to use interface, but it may impose a (small) cost on single-threaded sorts.

Jul 20 '22 19:07 funrollloops

Thanks @funrollloops for sharing the result, that's a nice efficiency boost.

We could certainly expose the partitioner. It should be in Sorter so it also has access to the buffer. Note that you can already use it (detail::Partition) by including vqsort-inl.h, either using only the static target (the baseline, e.g. AVX2 if compiling with -mavx2 -maes) or if you do your own dynamic dispatch.

If you do actually want to pay for dynamic dispatch (via an indirect call) per Partition, you could add a corresponding Partition128Asc counterpart to all the Sort128Asc etc functions in vqsort_*.cc, and we'd welcome such a patch.

Jul 21 '22 07:07 jan-wassenberg

I did a quick prototype, and the results are mixed. In my prototype I had the Partition function also invoke ChoosePivot. It turns out I also needed to pull in some memory allocation logic. Any guidance on how you would like this to look?

I wrote a benchmark to measure the performance. The two most important benchmarks here:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_Sort<int64_t, HwySort>/process_time                   8066 ms         8065 ms            1 bytes_per_second=507.882M/s items_per_second=66.5692M/s
BM_Sort<int64_t, HwyPartitionHwySort>/process_time       5037 ms        41470 ms            1 bytes_per_second=98.7694M/s items_per_second=12.9459M/s

This is running on a i7-12700F (20 threads, 8P cores + 4E cores) with two-channel DDR4 3600, locked at 3GHz for all cores.

We see a ~40% speedup, but at the cost of ~5x more CPU!

In this benchmark we are sorting 512M uniformly distributed int64s; hwy::Sort is invoked ~45 times to sort ~10-16M row blocks at a time. The total time we spend in hwy::Sort is 5x the single-threaded case, so it's not due to the partition calls.

I realized that with each partition call running at ~10GB/s, I'm just running out of memory bandwidth (measured at 30GB/s on this machine). I get ~4.5s wall clock latency when the concurrency is limited to three hardware threads instead of 20, with the extra CPU time cost dropping to only 10%:

Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_Sort<int64_t, HwySort>/process_time                   8237 ms         8237 ms            1 bytes_per_second=497.291M/s items_per_second=65.1809M/s
BM_Sort<int64_t, HwyPartitionHwySort>/process_time       4568 ms         9483 ms            1 bytes_per_second=431.916M/s items_per_second=56.6121M/s

So in production use, the number of threads will need to be limited. There's also the risk of eating up all the memory bandwidth in a multi-tenant service.

The full set of benchmarks:

Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_Sort<int64_t, StdSort>/process_time                  47572 ms        47569 ms            1 bytes_per_second=86.107M/s items_per_second=11.2862M/s
BM_Sort<int64_t, TbbSort>/process_time                   8866 ms        65022 ms            1 bytes_per_second=62.9937M/s items_per_second=8.2567M/s
BM_Sort<int64_t, HwySort>/process_time                   8035 ms         8034 ms            1 bytes_per_second=509.842M/s items_per_second=66.826M/s
BM_Sort<int64_t, StdPartitionHwySort>/process_time       8511 ms        23619 ms            1 bytes_per_second=173.423M/s items_per_second=22.7309M/s
BM_Sort<int64_t, HwyPartitionHwySort>/process_time       4723 ms        42258 ms            1 bytes_per_second=96.9273M/s items_per_second=12.7045M/s

Aug 15 '22 06:08 funrollloops

Indeed sounds like this system is memory bandwidth bound and that more threads are just stalled. Is that really the system your sort has to run on? 30 GB/s is quite anemic, AMD Genoa would have about 15(!) times as much.

Aug 15 '22 07:08 jan-wassenberg

Our production servers have more memory bandwidth, but not 10GB/s/thread. Even the Genoa system, if it has 450GB/s for 96 cores, only has 4.5 GB/s/thread. It isn't a dealbreaker, but I'll need to limit the concurrency to avoid starving other processes on the machine.

About the implementation of Partition(): I had to duplicate the buffer allocation logic from the shared Sort implementation in vqsort-inl.h, and I felt it made sense to also re-use ChoosePivot. I'm thinking a new function in vqsort-inl.h to centralize that logic across types and sort orders, leaving the functions in the .cc files to provide only d and st as is done for Sort.

The all equal case is currently handled by returning a sentinel value, but it would be nicer to have a clearer API -- maybe using std::optional. I'd use nullopt to mean all items are equal.

Does that sound reasonable?

Aug 16 '22 05:08 funrollloops

Yes, it is an unfortunate reality that current systems are imbalanced enough to strand a large fraction of cores for many workloads (exception: branchy code, which is still quite common).

I agree it makes sense to reuse ChoosePivot. Have you seen the major update to that code relative to what's in the paper?

For Partition, note that I was considering an optimization to allow detecting when all keys in the left or right partitions are equal. This would require a different interface - I suppose we could stash it in the two MSBs, or an output parameter bitfield? (Independent of that, let's avoid nullopt because we want to stay compatible with C++11 for now.)

Reusing buffer allocation SGTM, feel free to add a function for that that Sort also calls.

Aug 16 '22 08:08 jan-wassenberg

Closing, feel free to reopen if you'd like to continue the discussion.

Jan 17 '23 09:01 jan-wassenberg

highway highway copied to clipboard

Enable usage of sorter in parallel sorting algorithms

highway
highway copied to clipboard