x86-simd-sort
x86-simd-sort copied to clipboard
Reduce overhead and optimise partial sorting algorithm
The original implementation of the partial sorting algorithm was very basic: an initial avx512_qselect
pass followed by avx512_qsort
. This works, but unfortunately it can cause unnecessary comparisons and movement of data because the second sweep (using qsort
) does not know what region (that surrounds $k$) is already sorted and what partitions had already been made during the first sweep.
This PR changes the way avx512_partial_qsort
works so that it can specialise its behaviour while recursing. Compared with main, there was an almost 6% improvement in the partial sorting related methods.
Benchmarks were performed in the cloud on a VM with 8 vCPUs of an Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz.
Main vs PR Comparison
mosullivan@sprvm-1:~/x86-simd-sort/builddir$ $COMPARE benchmarks main.json pr.json
Comparing main.json to pr.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------
avx512_partial_qsort<float>/10 -0.1554 -0.1548 4145 3501 4124 3486
avx512_partial_qsort<float>/100 -0.1357 -0.1376 4359 3767 4339 3741
avx512_partial_qsort<float>/1000 -0.0847 -0.0855 7363 6739 7344 6717
avx512_partial_qsort<float>/5000 +0.0324 +0.0326 20576 21243 20558 21227
avx512_partial_qsort<uint32_t>/10 -0.1730 -0.1745 3468 2868 3445 2844
avx512_partial_qsort<uint32_t>/100 -0.1932 -0.1947 3553 2867 3529 2842
avx512_partial_qsort<uint32_t>/1000 -0.1125 -0.1130 5999 5324 5976 5301
avx512_partial_qsort<uint32_t>/5000 -0.0625 -0.0623 18569 17408 18552 17396
avx512_partial_qsort<int32_t>/10 -0.1699 -0.1714 3412 2833 3391 2810
avx512_partial_qsort<int32_t>/100 -0.1993 -0.2003 3542 2836 3518 2814
avx512_partial_qsort<int32_t>/1000 -0.1192 -0.1193 5964 5253 5938 5230
avx512_partial_qsort<int32_t>/5000 -0.0700 -0.0696 18375 17089 18356 17078
avx512_partial_qsort<double>/10 -0.0267 -0.0268 6043 5882 6049 5887
avx512_partial_qsort<double>/100 -0.0167 -0.0167 6179 6076 6184 6081
avx512_partial_qsort<double>/1000 -0.0998 -0.0995 10208 9189 10211 9195
avx512_partial_qsort<double>/5000 -0.0849 -0.0848 31744 29050 31752 29058
avx512_partial_qsort<uint64_t>/10 -0.0160 -0.0158 6769 6661 6775 6667
avx512_partial_qsort<uint64_t>/100 +0.0104 +0.0106 7101 7175 7106 7182
avx512_partial_qsort<uint64_t>/1000 -0.0634 -0.0634 15914 14904 15918 14908
avx512_partial_qsort<uint64_t>/5000 -0.0400 -0.0400 42757 41046 42766 41056
avx512_partial_qsort<int64_t>/10 -0.0153 -0.0143 6774 6670 6773 6676
avx512_partial_qsort<int64_t>/100 -0.0821 -0.0820 7734 7099 7741 7107
avx512_partial_qsort<int64_t>/1000 -0.0559 -0.0558 15862 14975 15868 14982
avx512_partial_qsort<int64_t>/5000 -0.0250 -0.0251 42110 41056 42119 41063
avx512_partial_qsort<uint16_t>/10 +0.0132 +0.0136 3347 3391 3347 3393
avx512_partial_qsort<uint16_t>/100 +0.0209 +0.0217 3447 3519 3446 3520
avx512_partial_qsort<uint16_t>/1000 -0.0723 -0.0713 5767 5350 5762 5351
avx512_partial_qsort<uint16_t>/5000 -0.0061 -0.0058 16529 16428 16525 16430
avx512_partial_qsort<int16_t>/10 -0.0316 -0.0307 3425 3317 3425 3320
avx512_partial_qsort<int16_t>/100 -0.0215 -0.0220 3527 3451 3527 3449
avx512_partial_qsort<int16_t>/1000 -0.1035 -0.1033 5852 5247 5852 5248
avx512_partial_qsort<int16_t>/5000 -0.0261 -0.0261 16579 16146 16581 16148
avx512_qselect<float>/10 -0.1456 -0.1477 4108 3510 4085 3481
avx512_qselect<float>/100 -0.1388 -0.1403 4200 3618 4178 3592
avx512_qselect<float>/1000 -0.1409 -0.1422 4267 3666 4242 3638
avx512_qselect<float>/5000 +0.0052 +0.0058 4135 4157 4109 4133
avx512_qselect<uint32_t>/10 -0.1477 -0.1499 3344 2850 3323 2825
avx512_qselect<uint32_t>/100 -0.1600 -0.1618 3402 2858 3380 2833
avx512_qselect<uint32_t>/1000 -0.1605 -0.1621 3689 3097 3668 3073
avx512_qselect<uint32_t>/5000 -0.0456 -0.0441 3888 3710 3860 3690
avx512_qselect<int32_t>/10 -0.1767 -0.1779 3435 2828 3413 2806
avx512_qselect<int32_t>/100 -0.1718 -0.1733 3418 2831 3397 2808
avx512_qselect<int32_t>/1000 -0.1721 -0.1730 3693 3057 3672 3036
avx512_qselect<int32_t>/5000 -0.0339 -0.0324 3889 3757 3861 3736
avx512_qselect<double>/10 -0.0103 -0.0093 5871 5811 5873 5818
avx512_qselect<double>/100 +0.0071 +0.0074 5845 5887 5849 5892
avx512_qselect<double>/1000 -0.0016 -0.0006 5588 5579 5584 5581
avx512_qselect<double>/5000 -0.0366 -0.0369 6839 6588 6845 6592
avx512_qselect<uint64_t>/10 -0.0013 -0.0010 6728 6720 6731 6725
avx512_qselect<uint64_t>/100 +0.0038 +0.0046 6690 6716 6692 6723
avx512_qselect<uint64_t>/1000 +0.0121 +0.0141 8970 9078 8958 9084
avx512_qselect<uint64_t>/5000 +0.0025 +0.0031 8752 8774 8755 8782
avx512_qselect<int64_t>/10 -0.0010 -0.0001 6730 6723 6729 6728
avx512_qselect<int64_t>/100 -0.0023 -0.0020 6750 6735 6753 6740
avx512_qselect<int64_t>/1000 -0.0096 -0.0096 9073 8987 9077 8991
avx512_qselect<int64_t>/5000 -0.0282 -0.0286 8752 8505 8756 8506
avx512_qselect<uint16_t>/10 -0.0116 -0.0101 3304 3266 3300 3267
avx512_qselect<uint16_t>/100 -0.0105 -0.0098 3353 3318 3349 3316
avx512_qselect<uint16_t>/1000 -0.0096 -0.0080 3376 3344 3372 3345
avx512_qselect<uint16_t>/5000 -0.0022 -0.0013 3651 3643 3650 3645
avx512_qselect<int16_t>/10 -0.0155 -0.0152 3356 3304 3356 3305
avx512_qselect<int16_t>/100 -0.0146 -0.0135 3393 3344 3392 3346
avx512_qselect<int16_t>/1000 -0.0154 -0.0153 3433 3381 3431 3379
avx512_qselect<int16_t>/5000 -0.0144 -0.0138 3691 3638 3689 3638
avx512_qselect<_Float16>/10 -0.0361 -0.0360 4298 4143 4299 4144
avx512_qselect<_Float16>/100 -0.0258 -0.0253 3754 3657 3753 3658
avx512_qselect<_Float16>/1000 -0.0352 -0.0313 3557 3432 3555 3444
avx512_qselect<_Float16>/5000 -0.0258 -0.0235 3463 3374 3456 3375
avx512_partial_qsort<_Float16>/10 -0.0020 +0.0004 3575 3568 3568 3569
avx512_partial_qsort<_Float16>/100 +0.0025 +0.0030 4105 4115 4104 4117
avx512_partial_qsort<_Float16>/1000 -0.1008 -0.1006 7399 6653 7399 6655
avx512_partial_qsort<_Float16>/5000 -0.0573 -0.0572 21202 19986 21198 19987
OVERALL_GEOMEAN -0.0593 -0.0593 0 0 0 0
Originally there were additional commits that made changes that are now irrelevant. Originally, this PR made large changes to the operation of avx512_qsort
, too, but it now focusses solely on specialising avx512_partial_qsort
instead. There was a logical mistake that slipped through the test cases and had artificially improved the sorting performance.
I've identified a rather significant bug in my changes. I'll retest this with the fix applied. I'm a little concerned that the issue wasn't picked up with the built in tests (at least when I ran locally). I'll see what I can learn.
I had a problem with the right recursion in the first version of this PR that I missed because all of the tests passed without complaint. The seeds are fixed as 42 in the functions in utils/rand_array.h
(I'm guessing this is for consistency in the benchmarking) so I didn't originally pick up on the issue because k
was always 4 in the partial range test and the recursion was always into the left partition until the bitonic sort would take over. (This also led me to believe my code was a lot faster than it really was 😬).
Anyway, this is working now so happy to get some feedback.
https://github.com/intel/x86-simd-sort/blob/85f4e9c9a2bbc8766116dca4fc1755e608ab517d/utils/rand_array.h#L21
https://github.com/intel/x86-simd-sort/blob/85f4e9c9a2bbc8766116dca4fc1755e608ab517d/utils/rand_array.h#L38
Ah. Sorry, I didn't run a full test on GitHub because I never pushed to my main branch. I see Google's changed their repository a little bit so the workflow cannot run. I've proposed a simple workaround on #54 and I can rebase these once this is resolved.
thanks for your PR, allow me some time to review it. will get back to you as soon as I can :)
Your benchmark also reports unto a 14% improvement on avx512_qselect
which is a bit surprising since there should be no changes to the way quick select runs with this patch. I ran quickselect benchmarks comparison multiple times on my SKX and don't see any change to it which is what I expect.
Thanks for taking a look, @r-devulap.
I'll rebase, address your comments, and investigate the performance when I have a chance over the next week or so.
@r-devulap I apologize for the delay in revisiting this. Given the updates to the library since this PR was authored, I expect quite a few adjustments will be required during rebasing. I do intend to handle it eventually. If it's better for repository management, let me know if I should close the PR for the time being.
No worries. You can keep it open if you wish, I don't mind either way.
outdated. Closing.