highway
highway copied to clipboard
Enable usage of sorter in parallel sorting algorithms
I'd like to use the highway sort in a parallel quicksort implementation. So far I've used the highway sorter in the base case of the parallel quicksort, since only sort is exposed. This alone makes a huge difference: it's ~30% faster than tbb::parallel_sort
for ~700M uint64s on a 12-thread laptop with 60% lower CPU usage.
I'd like to also use the vectorized partitioner in contrib/sort in the earlier stages. I estimate this could save up to another ~30% wall time.
If you are interested in accepting contributions to enable this, these are the two strategies I was considering:
- Expose the partitioner in the Sorter API, or perhaps in a separate API;
- Provide the sort with a
std::function<void(void())>
callback for scheduling sub-stages larger than a threshold to run potentially asynchronously.
the latter is probably an overall simpler to use interface, but it may impose a (small) cost on single-threaded sorts.
Thanks @funrollloops for sharing the result, that's a nice efficiency boost.
We could certainly expose the partitioner. It should be in Sorter so it also has access to the buffer. Note that you can already use it (detail::Partition) by including vqsort-inl.h, either using only the static target (the baseline, e.g. AVX2 if compiling with -mavx2 -maes) or if you do your own dynamic dispatch.
If you do actually want to pay for dynamic dispatch (via an indirect call) per Partition, you could add a corresponding Partition128Asc counterpart to all the Sort128Asc etc functions in vqsort_*.cc, and we'd welcome such a patch.
I did a quick prototype, and the results are mixed. In my prototype I had the Partition function also invoke ChoosePivot. It turns out I also needed to pull in some memory allocation logic. Any guidance on how you would like this to look?
I wrote a benchmark to measure the performance. The two most important benchmarks here:
-------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_Sort<int64_t, HwySort>/process_time 8066 ms 8065 ms 1 bytes_per_second=507.882M/s items_per_second=66.5692M/s
BM_Sort<int64_t, HwyPartitionHwySort>/process_time 5037 ms 41470 ms 1 bytes_per_second=98.7694M/s items_per_second=12.9459M/s
This is running on a i7-12700F (20 threads, 8P cores + 4E cores) with two-channel DDR4 3600, locked at 3GHz for all cores.
We see a ~40% speedup, but at the cost of ~5x more CPU!
In this benchmark we are sorting 512M uniformly distributed int64s; hwy::Sort is invoked ~45 times to sort ~10-16M row blocks at a time. The total time we spend in hwy::Sort is 5x the single-threaded case, so it's not due to the partition calls.
I realized that with each partition call running at ~10GB/s, I'm just running out of memory bandwidth (measured at 30GB/s on this machine). I get ~4.5s wall clock latency when the concurrency is limited to three hardware threads instead of 20, with the extra CPU time cost dropping to only 10%:
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_Sort<int64_t, HwySort>/process_time 8237 ms 8237 ms 1 bytes_per_second=497.291M/s items_per_second=65.1809M/s
BM_Sort<int64_t, HwyPartitionHwySort>/process_time 4568 ms 9483 ms 1 bytes_per_second=431.916M/s items_per_second=56.6121M/s
So in production use, the number of threads will need to be limited. There's also the risk of eating up all the memory bandwidth in a multi-tenant service.
The full set of benchmarks:
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_Sort<int64_t, StdSort>/process_time 47572 ms 47569 ms 1 bytes_per_second=86.107M/s items_per_second=11.2862M/s
BM_Sort<int64_t, TbbSort>/process_time 8866 ms 65022 ms 1 bytes_per_second=62.9937M/s items_per_second=8.2567M/s
BM_Sort<int64_t, HwySort>/process_time 8035 ms 8034 ms 1 bytes_per_second=509.842M/s items_per_second=66.826M/s
BM_Sort<int64_t, StdPartitionHwySort>/process_time 8511 ms 23619 ms 1 bytes_per_second=173.423M/s items_per_second=22.7309M/s
BM_Sort<int64_t, HwyPartitionHwySort>/process_time 4723 ms 42258 ms 1 bytes_per_second=96.9273M/s items_per_second=12.7045M/s
Indeed sounds like this system is memory bandwidth bound and that more threads are just stalled. Is that really the system your sort has to run on? 30 GB/s is quite anemic, AMD Genoa would have about 15(!) times as much.
Our production servers have more memory bandwidth, but not 10GB/s/thread. Even the Genoa system, if it has 450GB/s for 96 cores, only has 4.5 GB/s/thread. It isn't a dealbreaker, but I'll need to limit the concurrency to avoid starving other processes on the machine.
About the implementation of Partition()
: I had to duplicate the buffer allocation logic from the shared Sort implementation in vqsort-inl.h
, and I felt it made sense to also re-use ChoosePivot. I'm thinking a new function in vqsort-inl.h
to centralize that logic across types and sort orders, leaving the functions in the .cc files to provide only d
and st
as is done for Sort
.
The all equal case is currently handled by returning a sentinel value, but it would be nicer to have a clearer API -- maybe using std::optional
. I'd use nullopt
to mean all items are equal.
Does that sound reasonable?
Yes, it is an unfortunate reality that current systems are imbalanced enough to strand a large fraction of cores for many workloads (exception: branchy code, which is still quite common).
I agree it makes sense to reuse ChoosePivot. Have you seen the major update to that code relative to what's in the paper?
For Partition, note that I was considering an optimization to allow detecting when all keys in the left or right partitions are equal. This would require a different interface - I suppose we could stash it in the two MSBs, or an output parameter bitfield? (Independent of that, let's avoid nullopt because we want to stay compatible with C++11 for now.)
Reusing buffer allocation SGTM, feel free to add a function for that that Sort also calls.
Closing, feel free to reopen if you'd like to continue the discussion.