x86-simd-sort Reduce overhead and optimise partial sorting algorithm

The original implementation of the partial sorting algorithm was very basic: an initial avx512_qselect pass followed by avx512_qsort. This works, but unfortunately it can cause unnecessary comparisons and movement of data because the second sweep (using qsort) does not know what region (that surrounds $k$) is already sorted and what partitions had already been made during the first sweep.

This PR changes the way avx512_partial_qsort works so that it can specialise its behaviour while recursing. Compared with main, there was an almost 6% improvement in the partial sorting related methods.

Benchmarks were performed in the cloud on a VM with 8 vCPUs of an Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz.

Main vs PR Comparison

mosullivan@sprvm-1:~/x86-simd-sort/builddir$ $COMPARE benchmarks main.json pr.json
Comparing main.json to pr.json
Benchmark                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------
avx512_partial_qsort<float>/10                     -0.1554         -0.1548          4145          3501          4124          3486
avx512_partial_qsort<float>/100                    -0.1357         -0.1376          4359          3767          4339          3741
avx512_partial_qsort<float>/1000                   -0.0847         -0.0855          7363          6739          7344          6717
avx512_partial_qsort<float>/5000                   +0.0324         +0.0326         20576         21243         20558         21227
avx512_partial_qsort<uint32_t>/10                  -0.1730         -0.1745          3468          2868          3445          2844
avx512_partial_qsort<uint32_t>/100                 -0.1932         -0.1947          3553          2867          3529          2842
avx512_partial_qsort<uint32_t>/1000                -0.1125         -0.1130          5999          5324          5976          5301
avx512_partial_qsort<uint32_t>/5000                -0.0625         -0.0623         18569         17408         18552         17396
avx512_partial_qsort<int32_t>/10                   -0.1699         -0.1714          3412          2833          3391          2810
avx512_partial_qsort<int32_t>/100                  -0.1993         -0.2003          3542          2836          3518          2814
avx512_partial_qsort<int32_t>/1000                 -0.1192         -0.1193          5964          5253          5938          5230
avx512_partial_qsort<int32_t>/5000                 -0.0700         -0.0696         18375         17089         18356         17078
avx512_partial_qsort<double>/10                    -0.0267         -0.0268          6043          5882          6049          5887
avx512_partial_qsort<double>/100                   -0.0167         -0.0167          6179          6076          6184          6081
avx512_partial_qsort<double>/1000                  -0.0998         -0.0995         10208          9189         10211          9195
avx512_partial_qsort<double>/5000                  -0.0849         -0.0848         31744         29050         31752         29058
avx512_partial_qsort<uint64_t>/10                  -0.0160         -0.0158          6769          6661          6775          6667
avx512_partial_qsort<uint64_t>/100                 +0.0104         +0.0106          7101          7175          7106          7182
avx512_partial_qsort<uint64_t>/1000                -0.0634         -0.0634         15914         14904         15918         14908
avx512_partial_qsort<uint64_t>/5000                -0.0400         -0.0400         42757         41046         42766         41056
avx512_partial_qsort<int64_t>/10                   -0.0153         -0.0143          6774          6670          6773          6676
avx512_partial_qsort<int64_t>/100                  -0.0821         -0.0820          7734          7099          7741          7107
avx512_partial_qsort<int64_t>/1000                 -0.0559         -0.0558         15862         14975         15868         14982
avx512_partial_qsort<int64_t>/5000                 -0.0250         -0.0251         42110         41056         42119         41063
avx512_partial_qsort<uint16_t>/10                  +0.0132         +0.0136          3347          3391          3347          3393
avx512_partial_qsort<uint16_t>/100                 +0.0209         +0.0217          3447          3519          3446          3520
avx512_partial_qsort<uint16_t>/1000                -0.0723         -0.0713          5767          5350          5762          5351
avx512_partial_qsort<uint16_t>/5000                -0.0061         -0.0058         16529         16428         16525         16430
avx512_partial_qsort<int16_t>/10                   -0.0316         -0.0307          3425          3317          3425          3320
avx512_partial_qsort<int16_t>/100                  -0.0215         -0.0220          3527          3451          3527          3449
avx512_partial_qsort<int16_t>/1000                 -0.1035         -0.1033          5852          5247          5852          5248
avx512_partial_qsort<int16_t>/5000                 -0.0261         -0.0261         16579         16146         16581         16148
avx512_qselect<float>/10                           -0.1456         -0.1477          4108          3510          4085          3481
avx512_qselect<float>/100                          -0.1388         -0.1403          4200          3618          4178          3592
avx512_qselect<float>/1000                         -0.1409         -0.1422          4267          3666          4242          3638
avx512_qselect<float>/5000                         +0.0052         +0.0058          4135          4157          4109          4133
avx512_qselect<uint32_t>/10                        -0.1477         -0.1499          3344          2850          3323          2825
avx512_qselect<uint32_t>/100                       -0.1600         -0.1618          3402          2858          3380          2833
avx512_qselect<uint32_t>/1000                      -0.1605         -0.1621          3689          3097          3668          3073
avx512_qselect<uint32_t>/5000                      -0.0456         -0.0441          3888          3710          3860          3690
avx512_qselect<int32_t>/10                         -0.1767         -0.1779          3435          2828          3413          2806
avx512_qselect<int32_t>/100                        -0.1718         -0.1733          3418          2831          3397          2808
avx512_qselect<int32_t>/1000                       -0.1721         -0.1730          3693          3057          3672          3036
avx512_qselect<int32_t>/5000                       -0.0339         -0.0324          3889          3757          3861          3736
avx512_qselect<double>/10                          -0.0103         -0.0093          5871          5811          5873          5818
avx512_qselect<double>/100                         +0.0071         +0.0074          5845          5887          5849          5892
avx512_qselect<double>/1000                        -0.0016         -0.0006          5588          5579          5584          5581
avx512_qselect<double>/5000                        -0.0366         -0.0369          6839          6588          6845          6592
avx512_qselect<uint64_t>/10                        -0.0013         -0.0010          6728          6720          6731          6725
avx512_qselect<uint64_t>/100                       +0.0038         +0.0046          6690          6716          6692          6723
avx512_qselect<uint64_t>/1000                      +0.0121         +0.0141          8970          9078          8958          9084
avx512_qselect<uint64_t>/5000                      +0.0025         +0.0031          8752          8774          8755          8782
avx512_qselect<int64_t>/10                         -0.0010         -0.0001          6730          6723          6729          6728
avx512_qselect<int64_t>/100                        -0.0023         -0.0020          6750          6735          6753          6740
avx512_qselect<int64_t>/1000                       -0.0096         -0.0096          9073          8987          9077          8991
avx512_qselect<int64_t>/5000                       -0.0282         -0.0286          8752          8505          8756          8506
avx512_qselect<uint16_t>/10                        -0.0116         -0.0101          3304          3266          3300          3267
avx512_qselect<uint16_t>/100                       -0.0105         -0.0098          3353          3318          3349          3316
avx512_qselect<uint16_t>/1000                      -0.0096         -0.0080          3376          3344          3372          3345
avx512_qselect<uint16_t>/5000                      -0.0022         -0.0013          3651          3643          3650          3645
avx512_qselect<int16_t>/10                         -0.0155         -0.0152          3356          3304          3356          3305
avx512_qselect<int16_t>/100                        -0.0146         -0.0135          3393          3344          3392          3346
avx512_qselect<int16_t>/1000                       -0.0154         -0.0153          3433          3381          3431          3379
avx512_qselect<int16_t>/5000                       -0.0144         -0.0138          3691          3638          3689          3638
avx512_qselect<_Float16>/10                        -0.0361         -0.0360          4298          4143          4299          4144
avx512_qselect<_Float16>/100                       -0.0258         -0.0253          3754          3657          3753          3658
avx512_qselect<_Float16>/1000                      -0.0352         -0.0313          3557          3432          3555          3444
avx512_qselect<_Float16>/5000                      -0.0258         -0.0235          3463          3374          3456          3375
avx512_partial_qsort<_Float16>/10                  -0.0020         +0.0004          3575          3568          3568          3569
avx512_partial_qsort<_Float16>/100                 +0.0025         +0.0030          4105          4115          4104          4117
avx512_partial_qsort<_Float16>/1000                -0.1008         -0.1006          7399          6653          7399          6655
avx512_partial_qsort<_Float16>/5000                -0.0573         -0.0572         21202         19986         21198         19987
OVERALL_GEOMEAN                                    -0.0593         -0.0593             0             0             0             0

Originally there were additional commits that made changes that are now irrelevant. Originally, this PR made large changes to the operation of avx512_qsort, too, but it now focusses solely on specialising avx512_partial_qsort instead. There was a logical mistake that slipped through the test cases and had artificially improved the sorting performance.

Jun 26 '23 23:06 mosullivan93

I've identified a rather significant bug in my changes. I'll retest this with the fix applied. I'm a little concerned that the issue wasn't picked up with the built in tests (at least when I ran locally). I'll see what I can learn.

Jun 28 '23 08:06 mosullivan93

I had a problem with the right recursion in the first version of this PR that I missed because all of the tests passed without complaint. The seeds are fixed as 42 in the functions in utils/rand_array.h (I'm guessing this is for consistency in the benchmarking) so I didn't originally pick up on the issue because k was always 4 in the partial range test and the recursion was always into the left partition until the bitonic sort would take over. (This also led me to believe my code was a lot faster than it really was 😬).

Anyway, this is working now so happy to get some feedback.

https://github.com/intel/x86-simd-sort/blob/85f4e9c9a2bbc8766116dca4fc1755e608ab517d/utils/rand_array.h#L21

https://github.com/intel/x86-simd-sort/blob/85f4e9c9a2bbc8766116dca4fc1755e608ab517d/utils/rand_array.h#L38

Jun 28 '23 16:06 mosullivan93

Ah. Sorry, I didn't run a full test on GitHub because I never pushed to my main branch. I see Google's changed their repository a little bit so the workflow cannot run. I've proposed a simple workaround on #54 and I can rebase these once this is resolved.

Jun 29 '23 04:06 mosullivan93

thanks for your PR, allow me some time to review it. will get back to you as soon as I can :)

Jul 04 '23 04:07 r-devulap

Your benchmark also reports unto a 14% improvement on avx512_qselect which is a bit surprising since there should be no changes to the way quick select runs with this patch. I ran quickselect benchmarks comparison multiple times on my SKX and don't see any change to it which is what I expect.

Aug 03 '23 20:08 r-devulap

Thanks for taking a look, @r-devulap.

I'll rebase, address your comments, and investigate the performance when I have a chance over the next week or so.

Aug 07 '23 07:08 mosullivan93

@r-devulap I apologize for the delay in revisiting this. Given the updates to the library since this PR was authored, I expect quite a few adjustments will be required during rebasing. I do intend to handle it eventually. If it's better for repository management, let me know if I should close the PR for the time being.

Sep 14 '23 03:09 mosullivan93

No worries. You can keep it open if you wish, I don't mind either way.

Sep 18 '23 04:09 r-devulap

outdated. Closing.

Oct 07 '24 22:10 r-devulap

x86-simd-sort x86-simd-sort copied to clipboard

Reduce overhead and optimise partial sorting algorithm

x86-simd-sort
x86-simd-sort copied to clipboard