x86-simd-sort
x86-simd-sort copied to clipboard
Argsort on int32_t severely underperforms
Hello, I just want to make sure I'm not doing something wrong. I stumbled upon this project when I found that IPP's radix sort had buffer size calculations that were overflowing well before the limits of an INT32_MAX and all of the "Index" variants also only used 32 bit types. To be clear, a 32 bit index variant still would be beneficial and I see there are plans to eventually implement that. It's just I need a conditional code path for when there are too many values (I have a dataset generated that has length in the billions). I noticed that when the code makes use of the AVX512 instructions for argsort, it severely under performs against the radix sort (which as far as I can tell from the assembly, is completely scalar except for an iota like scan to generate the sequential index).
The benchmarks distributed with this project do show it winning by some margins against the scalar variants in here but not by the typical order of magnitudes that were touted when this project was first introduced. Is this maybe a result of the gather data sampling mitigations in microcode? I know this heavily uses compressed store and gather, and those functions were severely limited by this.
Here's my benchmark on my hardware:
[astylinski@fedoravm builddir]$ ./benchexe --benchmark_filter=/int32_t
2023-12-03T11:30:05-05:00
Running ./benchexe
Run on (6 X 3312 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 4096 KiB (x6)
L3 Unified 16384 KiB (x1)
Load Average: 1.02, 0.68, 0.32
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
simdargsort/random_128/int32_t 959 ns 956 ns 731536
simdargsort/random_256/int32_t 2045 ns 2040 ns 343034
simdargsort/random_512/int32_t 3866 ns 3857 ns 181511
simdargsort/random_1k/int32_t 9289 ns 9267 ns 75493
simdargsort/random_5k/int32_t 50900 ns 50768 ns 13789
simdargsort/random_100k/int32_t 1764351 ns 1757047 ns 400
simdargsort/random_1m/int32_t 29256670 ns 29111150 ns 24
simdargsort/random_10m/int32_t 959237861 ns 954095660 ns 1
simdargsort/smallrange_128/int32_t 722 ns 720 ns 975799
simdargsort/smallrange_256/int32_t 2049 ns 2044 ns 342384
simdargsort/smallrange_512/int32_t 3655 ns 3646 ns 191985
simdargsort/smallrange_1k/int32_t 9169 ns 9147 ns 76497
simdargsort/smallrange_5k/int32_t 21375 ns 21320 ns 32747
simdargsort/smallrange_100k/int32_t 467703 ns 465853 ns 1502
simdargsort/smallrange_1m/int32_t 9097255 ns 9047326 ns 75
simdargsort/smallrange_10m/int32_t 183596592 ns 182645622 ns 4
simdargsort/sorted_10k/int32_t 104574 ns 104314 ns 6668
simdargsort/constant_10k/int32_t 8648 ns 8628 ns 81134
simdargsort/reverse_10k/int32_t 105977 ns 105722 ns 6558
scalarargsort/random_128/int32_t 797 ns 795 ns 882026
scalarargsort/random_256/int32_t 1736 ns 1732 ns 403901
scalarargsort/random_512/int32_t 4000 ns 3990 ns 176210
scalarargsort/random_1k/int32_t 21879 ns 21822 ns 31937
scalarargsort/random_5k/int32_t 262249 ns 261537 ns 2676
scalarargsort/random_100k/int32_t 7551632 ns 7525002 ns 93
scalarargsort/random_1m/int32_t 103352847 ns 102895746 ns 7
scalarargsort/random_10m/int32_t 1974523438 ns 1964054017 ns 1
scalarargsort/smallrange_128/int32_t 667 ns 665 ns 1051997
scalarargsort/smallrange_256/int32_t 1498 ns 1494 ns 469065
scalarargsort/smallrange_512/int32_t 3152 ns 3144 ns 223444
scalarargsort/smallrange_1k/int32_t 13730 ns 13697 ns 50943
scalarargsort/smallrange_5k/int32_t 127606 ns 127285 ns 5499
scalarargsort/smallrange_100k/int32_t 3031465 ns 3019854 ns 232
scalarargsort/smallrange_1m/int32_t 38177852 ns 37995219 ns 18
scalarargsort/smallrange_10m/int32_t 695700305 ns 692244521 ns 1
scalarargsort/sorted_10k/int32_t 92423 ns 92202 ns 7578
scalarargsort/constant_10k/int32_t 83139 ns 82931 ns 8439
scalarargsort/reverse_10k/int32_t 74214 ns 74031 ns 9359
HI @KungFuJesus. We haven't compared performance of avx512_argsort
with IPP SortRadixIndexAscend yet. I will try to do this sometime next week. avx512_argsort
used to leverage gather instruction heavily, but because of the DOWNFALL security vulnerability, we replaced that with a scalar emulation of it in https://github.com/intel/x86-simd-sort/pull/65 and I believe that has impacted its performance. Also worthy to note that IPP sort uses an additional O(N) space whereas avx512_argsort
is O(1) space, so I am not sure if this is an apples-to-apples comparison in the first place.
Your benchmark numbers look right, and I don't believe we claimed avx512_argsort
to be the orders of magnitude faster than its scalar counterpart.
I'm sure this project never made the claim, I just recall Phoronix touting something like a 10x gain when numpy used it in place of std::sort. I'm sure there'll be nuance there. The scalar replacement does explain why me disabling GDS mitigations didn't seem to make a difference, though. The additional memory requirement is acceptable in my case. I do wonder if the sorting network approach could be applied to a radix sort for the moves. E.g. instead of comparisons for the sorting network, the bucket positions get computed. I'm like 90% sure IPP's using purely scalar operations for everything but a linear sequence generation.
I'm sure this project never made the claim, I just recall Phoronix touting something like a 10x gain when numpy used it in place of std::sort
I believe that was for quicksort, not argsort.
The additional memory requirement is acceptable in my case.
If additional space is acceptable, then you could try using key-value sort keyvalue_qsort
instead of argsort. Key-value sort will sort two vectors based on the value of one of them. Pass a copy of your array and an array of indices initialized to [0, 1, 2, .. N-1]
. Unlike argsort, key-value sort doesn't rely on gather (or its emulation) to load data into registers and should, in theory, be faster. It does have the downside that you have make a copy of the array which can be expensive. Would be curious to know how it performs.
I had tried that initially to try to get 32 bit values for indices and it wasn't working for me correctly for whatever reason. I didn't try it with size_t indices, though.
32-bit indices weren't supported until recently: https://github.com/intel/x86-simd-sort/pull/108. Did you try it after this was merged?
Yes, last commit I was trying against was: d9c97379049cd549d5e386849d744c31f21a2f83
And it was wrong, and not wrong because it was unstable, but something very wonky. I'm computing a voxelization index and scanning through these voxel indices post sort with std::upper_bound to get their groupings. I went from something like 40k unique indices down to 2 or 3. Let me reconstruct the code I used to test, it should be a simple thing to verify correctness/incorrectness for if there was a mistake on my end.
edit: this is what I used
std::vector<uint32_t> argSort32(int32_t *vals, size_t numVals)
{
std::vector<uint32_t> idxs(numVals);
std::iota(idxs.begin(), idxs.end(), 0);
x86simdsort::keyvalue_qsort<uint32_t, int32_t>(idxs.data(), vals, numVals);
return idxs;
}
Yeah, even with signed 32 bit types for the index I'm getting 4 unique voxel nodes instead of 47198 like there should be for this particular set of arguments.
Also flipping the keys and argument values around results 3 unique nodes, so, I don't think that's it, either.
If you can't reproduce this issue with random data I can gladly provide you this data in ascii since it's just indices, but it's going to be about 123 million integers.
your argsort implementation looks incorrect. You want to sort idxs
based on the values in vals
: therefore vals
is the key and idxs
is the value.
std::vector<uint32_t> argSort32(int32_t *vals, size_t numVals)
{
std::vector<uint32_t> idxs(numVals);
std::iota(idxs.begin(), idxs.end(), 0);
x86simdsort::keyvalue_qsort(vals, idxs.data(), numVals);
return idxs;
}
Couple of things to note: numVals
needs to be < UINT32_MAX
. Also, I am assuming vals
is a copy of your data because the key-value sort will sort both vals
and idxs
arrays.
idxs
should have numVals
unique values (basically [0,1,2, ..., numVals-1]
in a jumbled up order, else there is something seriously wrong with the code.
numVals == ~103 million, which is certainly smaller than the 2.5ish billion value of UINT32_MAX. Like I said before, I switched around vals and idxs in case I was confusing keys and values and still ended up with only 3 unique values. I haven't printed even a small slice of the output yet, I'm mostly just looking at what my std::upper_bound scan is concluding.
Did you want a gzip compressed text document of the vals array?
Also, I am assuming vals is a copy of your data because the key-value sort will sort both vals and idxs arrays.
In this circumstance it's ok to be destructive to vals as this is a temporary voxel node index calculation that doesn't survive past this. It's merely there to get row indices for a bunch of 3d points to arrange spatially in a grid.
Err wait...no that's not strictly true. OK, so I may need to make an explicit copy. Let me try that.
Ok, yeah that was it. Thanks for the tip. Here's the timing:
radix sort time: 3825.987119, kvSort time = 4555.936716
So radix sort is still winning there by a fair margin :(. Now, I did revert to the commit just prior to your PR to see what the performance was like with actual gathers (and GDS mitigations disabled, of course), and at least with the benchmarks it looks ~1.8x faster, so maybe it might have been beating IPP's index radix sort if we could rely on the microcode mitigations not being loaded.
Another thing going for this is that the reason it wasn't working is I was doing what the code was already doing, deriving the post sorted values. So, there's some savings there too but maybe not enough to beat radix index sort alone.
benchmarks it looks ~1.8x faster, so maybe it might have been beating IPP's index radix sort
Yeah, that is unfortunate. Hopefully future hardware won't need these mitigations and we can have a version of argsort with the gather instruction.
For there is scope of performance improvement for the [32-bit, 32-bit] key value sort, which we are working on. Will keep this open till then.
@KungFuJesus PR #120 improves 32-bit key value sorting by a fair bit. Do you mind timing your benchmarks with this patch to see if it improves anything at your end?
Sure, I'll check it out today and see the improvements. The mention of zmm registers I assume means everything is still AVX512 only - is this sort implementation possible on AVX2 era instructions or is that forever stuck on scalar implementation?
Hmm, I'm not seeing any significant improvements from this. I am using the key-value sort explicitly rather than argsort in my code, does that matter?
edit: Ok, I am seeing a marginal improvement for a much much larger set:
Before:
[INFO] VoxelGrid.cpp:93:VoxelGrid(): Voxel node sort completed in 53652.993568 ms
After:
[INFO] VoxelGrid.cpp:93:VoxelGrid(): Voxel node sort completed in 48680.894693 ms
Is this the expected gain or should we have seen something more significant?
is this sort implementation possible on AVX2 era instructions or is that forever stuck on scalar implementation?
https://github.com/intel/x86-simd-sort/pull/119 adds AVX2 versions but the performance gains won't match AVX-512.
Is this the expected gain or should we have seen something more significant?
On random data, I am seeing a 2x improvement in my benchmarking. What does your data distribution look like? Are you still benchmarking this piece of code where both the key and value vectors are 32-bit?
std::vector<uint32_t> argSort32(int32_t *vals, size_t numVals)
{
std::vector<uint32_t> idxs(numVals);
std::iota(idxs.begin(), idxs.end(), 0);
x86simdsort::keyvalue_qsort(vals, idxs.data(), numVals);
return idxs;
}
Also, what is the array size?
What does your data distribution look like?
It's 3d point cloud data that's likely to have a partial amount of order to it. The point locations themselves are used to compute an integer voxel node index by dividing by a uniform grid length on all 3 dimensions. I have a variant that produces this from the GPU and the data gets returned back as the rays intersect, with an atomic value on the gpu determining insertion index (making it slightly more random). The CPU version comes back in the order that the rays are cast, but the rays are formed in a 2d grid on a plane from some distance away, no guarantee for the data to be in voxel node order at all.
Are you still benchmarking this piece of code where both the key and value vectors are 32-bit?
Yes, I made sure of that, we're operating on unsigned 32 bit indices and the values themselves (the voxel nodes) are signed. Not sure if that matters at all, though.
Also, what is the array size?
This particular example is 891416312 nodes. Though I tried it on a significantly smaller dataset (41445847 node indices) so that it wasn't going to be strictly a memory bandwidth benchmark and it looked to be roughly the same time before and after (maybe 100ms worse).
Hmm not sure if I am measuring it wrong but I am seeing results differently. PR #121 compares copying the original array and sort indices using keyvalue_qsort
against ippsSortRadixIndexAscend_32s
and from what I am measuring keyvalue_qsort is faster than ippSortRadixIndex.
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t] -0.5941 -0.5939 2657 1078 2660 1080
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t] -0.5208 -0.5195 3260 1562 3261 1567
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t] -0.4477 -0.4476 4506 2489 4510 2491
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t] -0.3853 -0.3865 7284 4478 7304 4481
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t] -0.3024 -0.3020 32020 22338 32037 22360
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t] -0.2371 -0.2370 901383 687703 901416 687765
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t] -0.6831 -0.6831 27155120 8604753 27153895 8604067
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t] -0.9221 -0.9221 1785294341 139015600 1785197181 139002825
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t] -0.9313 -0.9313 23847222073 1637784682 23845959455 1637654538
[ippargsort.*random.*int32_t vs. simd_ordern_argsort.*random.*int32_t]_pvalue 0.3413 0.3413 U Test, Repetitions: 9 vs 18
OVERALL_GEOMEAN -0.6612 -0.6610 0 0 0 0
EDIT: these are measured for various array sizes: [128, 256, 512, 1024, 5k, 100k, 1million, 10m and 100m]
I can provide a dataset, if that would help. I'm fairly confident sharing it would be acceptable given that they are just arbitrary voxel indices.
sure. I can take a look.
So here's a smaller example. A quick note, this does happen to be faster overall because it applies the sorting to the keys, which is why I'm using it over IPP's radix sort while on AVX512. However, the IPP IndexRadixSortAscend had been faster than argsort by itself, which should be a more fair comparison. I've not tested the argsort function since then, but with this PR, the invocation of keyvalue_qsort is not any faster.
The data itself is a bunch of little endian 32 bit integers serialized to a file and then that is gzipped. So you should be able to use something like fread on little endian to get all of the integers (calculating the size by dividing the number of bytes by 4). exampleInput.gz
This is from voxels coming out of the Embree code, so they are less scrambled than the GPU counterpart but not guaranteed to be in order.
Any progress on testing this data set?
apologies, I haven't had time yet. I will try to take a look at it this week.
HI @KungFuJesus yeah looks like for the data distribution you have, I am also measuring ippindexsort to be faster than avx-512 argsort. I wonder if the new pivot selection might help with it. Could you try benchmarking with https://github.com/intel/x86-simd-sort/pull/134?