x86-simd-sort
x86-simd-sort copied to clipboard
Cleanup for single vector sort/bitonic merge (and minor cleanup for argsort/argselect)
This patch rewrites all of the single vector sorting and bitonic merging to use swizzle ops and generic masks to reduce code duplication. It also centralizes all of this logic into one file. Also I did a small cleanup to the argsort code
This should have no impact on performance at all, and my testing seems to confirm this:
AVX512 1 million performance
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_1m vs. simdsort/random_1m]/uint64_t -0.0000 -0.0000 7758757 7758751 7758474 7758413
[simdsort/random_1m vs. simdsort/random_1m]/int64_t +0.0044 +0.0043 7753754 7787564 7753621 7786777
[simdsort/random_1m vs. simdsort/random_1m]/uint32_t +0.0031 +0.0031 3601040 3612155 3600724 3611744
[simdsort/random_1m vs. simdsort/random_1m]/int32_t -0.0007 -0.0007 3615923 3613349 3615524 3613013
[simdsort/random_1m vs. simdsort/random_1m]/uint16_t +0.0022 +0.0021 121162973 121427061 121160994 121420633
[simdsort/random_1m vs. simdsort/random_1m]/int16_t -0.0024 -0.0023 121280015 120989927 121272200 120988256
[simdsort/random_1m vs. simdsort/random_1m]/float +0.0039 +0.0038 3482392 3495850 3482070 3495381
[simdsort/random_1m vs. simdsort/random_1m]/double -0.0048 -0.0048 6564316 6533029 6564143 6532438
OVERALL_GEOMEAN +0.0007 +0.0007 0 0 0 0
AVX512 1k performance
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_1k vs. simdsort/random_1k]/uint64_t +0.0034 +0.0033 4372 4387 4377 4392
[simdsort/random_1k vs. simdsort/random_1k]/int64_t -0.0019 -0.0016 4373 4365 4375 4368
[simdsort/random_1k vs. simdsort/random_1k]/uint32_t +0.0022 +0.0010 2459 2465 2469 2471
[simdsort/random_1k vs. simdsort/random_1k]/int32_t +0.0066 +0.0075 2453 2469 2463 2482
[simdsort/random_1k vs. simdsort/random_1k]/uint16_t -0.0042 -0.0042 64846 64577 64860 64588
[simdsort/random_1k vs. simdsort/random_1k]/int16_t -0.0009 -0.0010 65761 65699 65786 65718
[simdsort/random_1k vs. simdsort/random_1k]/float +0.0099 +0.0099 2295 2318 2305 2328
[simdsort/random_1k vs. simdsort/random_1k]/double +0.0020 +0.0011 3792 3800 3800 3805
OVERALL_GEOMEAN +0.0021 +0.0020 0 0 0 0 +0.0014 +0.0015 0 0 0 0
AVX512 16 value performance
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_16 vs. simdsort/random_16]/uint64_t -0.0004 -0.0002 859 859 862 862
[simdsort/random_16 vs. simdsort/random_16]/int64_t +0.0008 +0.0007 861 862 864 865
[simdsort/random_16 vs. simdsort/random_16]/uint32_t +0.0004 +0.0005 850 850 853 853
[simdsort/random_16 vs. simdsort/random_16]/int32_t -0.0002 -0.0002 851 851 854 854
[simdsort/random_16 vs. simdsort/random_16]/uint16_t +0.0000 -0.0000 1124 1124 1126 1126
[simdsort/random_16 vs. simdsort/random_16]/int16_t +0.0024 +0.0023 1133 1135 1135 1138
[simdsort/random_16 vs. simdsort/random_16]/float +0.0062 +0.0067 868 873 870 876
[simdsort/random_16 vs. simdsort/random_16]/double +0.0018 +0.0019 867 868 869 871
OVERALL_GEOMEAN +0.0014 +0.0015 0 0 0 0
I also simply deduplicated the logic for argsort/argselect, and cleaned up some naming there
Note that currently the test suite does not test the single vector sorting logic for qsort. The key-value versions are effectively tested by the kvsort tests, but nothing ends up testing the non key-value versions.