x86-simd-sort
x86-simd-sort copied to clipboard
Adds support for kv-select, kv-partial sort, and descending order for all key-value functions.
This patch adds support for descending kv-sort and ascending/descending kv-select and kv-partial_sort For reference, some benchmarks comparing to Pytorch's scalar implementation are provided:
With normally distributed float32:
Partial Sort (sorted topk):
size default avx2 avx512 AVX2 speedup % AVX512 speedup %
16 4.180374384 4.0826931 4.202754736 -2.336663537 0.53536717
128 5.362578869 6.11398387 5.229052782 14.01200838 -2.489960336
1024 14.10357333 14.0685565 9.240571122 -0.248283326 -34.48063899
10000 322.740033 89.83533496 48.29404964 -72.1647996 -85.03623824
100000 3757.874566 1168.889193 721.571655 -68.89493856 -80.79841032
1000000 45415.52301 11909.35397 8130.106271 -73.77690891 -82.09839779
10000000 539107.6565 217536.211 171638.298 -59.64883667 -68.16251894
100000000 6410786.629 2307972.193 1874841.69 -63.99861161 -70.75488862
Select (unsorted topk):
size default avx2 avx512 AVX2 speedup % AVX512 speedup %
16 4.046328783 4.075162649 4.120069742 0.712593258 1.822416396
128 4.912573099 5.651566267 4.852572441 15.04289408 -1.221369267
1024 9.33410556 7.444588822 6.450861102 -20.24314731 -30.88934917
10000 131.0603339 30.31828351 21.59984892 -76.8669264 -83.5191562
100000 1273.479035 461.2905271 368.7236242 -63.77714006 -71.04596038
1000000 12720.43527 4615.330229 6321.014143 -63.71719891 -50.30819302
10000000 253137.2547 130307.7698 125634.7179 -48.52287945 -50.36893401
100000000 2394076.347 1226446.152 1257464.17 -48.77163575 -47.47602048
With uniformly distributed (from 0 to 10e9) int32:
Partial Sort (sorted topk):
size default avx2 avx512 AVX2 speedup % AVX512 speedup %
16 4.159518957 4.193886042 4.062754393 0.82622738 -2.326340269
128 4.946233034 6.055155277 5.076540947 22.41953089 2.634487941
1024 10.11158738 12.40083982 8.094132173 22.63989176 -19.95191391
10000 235.6421399 81.19698569 47.26006824 -65.54224736 -79.94413552
100000 2726.637111 1325.142324 692.2358897 -51.40012146 -74.61210049
1000000 32515.39025 13960.33343 7689.116048 -57.06545939 -76.35237963
10000000 461327.1713 221366.8823 176379.2038 -52.01520828 -61.76700295
100000000 5129393.339 2428071.022 1890691.996 -52.66358297 -63.14004658
Select (unsorted topk):
size default avx2 avx512 AVX2 speedup % AVX512 speedup %
16 7.28207159 7.344153643 7.334493876 0.852532847 0.719881485
128 7.539570808 9.854883432 8.218926668 30.70881198 9.010537563
1024 14.03857871 19.15253601 10.045604 36.427885 -28.4428701
10000 146.630404 48.96809058 29.91605115 -66.60440861 -79.59764801
100000 1304.309364 585.7699348 807.1453417 -55.08964737 -38.11703236
1000000 16298.37354 7968.290179 8426.867279 -51.10990578 -48.29626859
10000000 276746.4638 175631.3324 161258.0299 -36.5370997 -41.73077129
100000000 3139316.559 1701539.516 1615918.875 -45.79904624 -48.5264119