x86-simd-sort
x86-simd-sort copied to clipboard
performance on amd 7950x ...
Hello, I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?
Thanks, Jan
I even tried to use aocc compiler since support for zen4 is limited in gcc-12 but I ended with similar results.
array type | typeid name | dtype size | array size | avx512 sort | std sort | speed up |
---|---|---|---|---|---|---|
uniform random | j | 4 | 10000 | 590233 | 1038136 | 1.8 |
uniform random | j | 4 | 100000 | 8664066 | 13623417 | 1.6 |
uniform random | j | 4 | 1000000 | 117262111 | 161208450 | 1.4 |
uniform random | i | 4 | 10000 | 602964 | 1026301 | 1.7 |
uniform random | i | 4 | 100000 | 8858911 | 13546989 | 1.5 |
uniform random | i | 4 | 1000000 | 116628655 | 163222875 | 1.4 |
uniform random | f | 4 | 10000 | 609786 | 1541317 | 2.5 |
uniform random | f | 4 | 100000 | 8699332 | 20898121 | 2.4 |
uniform random | f | 4 | 1000000 | 119333844 | 245811136 | 2.1 |
uniform random | m | 8 | 10000 | 976378 | 1045651 | 1.1 |
uniform random | m | 8 | 100000 | 14633266 | 13486549 | 0.9 |
uniform random | m | 8 | 1000000 | 191530741 | 161761275 | 0.8 |
uniform random | l | 8 | 10000 | 978412 | 1026855 | 1.0 |
uniform random | l | 8 | 100000 | 14243476 | 13692690 | 1.0 |
uniform random | l | 8 | 1000000 | 188651173 | 161367916 | 0.9 |
uniform random | d | 8 | 10000 | 969286 | 1462486 | 1.5 |
uniform random | d | 8 | 100000 | 14418094 | 20359714 | 1.4 |
uniform random | d | 8 | 1000000 | 190162746 | 237778132 | 1.3 |
uniform random | t | 2 | 10000 | 402268 | 1040013 | 2.6 |
uniform random | t | 2 | 100000 | 5960407 | 13963662 | 2.3 |
uniform random | t | 2 | 1000000 | 80146876 | 146804863 | 1.8 |
uniform random | s | 2 | 10000 | 405211 | 1014493 | 2.5 |
uniform random | s | 2 | 100000 | 5957181 | 13663327 | 2.3 |
uniform random | s | 2 | 1000000 | 80590644 | 148901683 | 1.8 |
reverse | j | 4 | 10000 | 542799 | 126346 | 0.2 |
reverse | j | 4 | 100000 | 7797145 | 1489576 | 0.2 |
reverse | j | 4 | 1000000 | 98511309 | 16375909 | 0.2 |
reverse | i | 4 | 10000 | 520938 | 119398 | 0.2 |
reverse | i | 4 | 100000 | 7450528 | 1429902 | 0.2 |
reverse | i | 4 | 1000000 | 97871607 | 16308108 | 0.2 |
reverse | f | 4 | 10000 | 542155 | 315031 | 0.6 |
reverse | f | 4 | 100000 | 7771279 | 3420900 | 0.4 |
reverse | f | 4 | 1000000 | 100643359 | 36369022 | 0.4 |
reverse | m | 8 | 10000 | 866056 | 120451 | 0.1 |
reverse | m | 8 | 100000 | 12288924 | 1437381 | 0.1 |
reverse | m | 8 | 1000000 | 162711846 | 16352523 | 0.1 |
reverse | l | 8 | 10000 | 865872 | 121270 | 0.1 |
reverse | l | 8 | 100000 | 12402765 | 1483924 | 0.1 |
reverse | l | 8 | 1000000 | 163603746 | 16913736 | 0.1 |
reverse | d | 8 | 10000 | 898384 | 130437 | 0.1 |
reverse | d | 8 | 100000 | 12923145 | 1649754 | 0.1 |
reverse | d | 8 | 1000000 | 163598584 | 18106596 | 0.1 |
reverse | t | 2 | 10000 | 364963 | 117297 | 0.3 |
reverse | t | 2 | 100000 | 5114641 | 11812023 | 2.3 |
reverse | t | 2 | 1000000 | 71573971 | 64989841 | 0.9 |
reverse | s | 2 | 10000 | 364770 | 117121 | 0.3 |
reverse | s | 2 | 100000 | 5100520 | 8060481 | 1.6 |
reverse | s | 2 | 1000000 | 71520282 | 61821589 | 0.9 |
ordered | j | 4 | 10000 | 516910 | 155038 | 0.3 |
ordered | j | 4 | 100000 | 7505586 | 1914655 | 0.3 |
ordered | j | 4 | 1000000 | 97838487 | 22667913 | 0.2 |
ordered | i | 4 | 10000 | 515178 | 154467 | 0.3 |
ordered | i | 4 | 100000 | 7497252 | 1917751 | 0.3 |
ordered | i | 4 | 1000000 | 97166173 | 22689558 | 0.2 |
ordered | f | 4 | 10000 | 534312 | 352012 | 0.7 |
ordered | f | 4 | 100000 | 7743415 | 3883455 | 0.5 |
ordered | f | 4 | 1000000 | 99754393 | 42380284 | 0.4 |
ordered | m | 8 | 10000 | 854244 | 156010 | 0.2 |
ordered | m | 8 | 100000 | 12244765 | 1916253 | 0.2 |
ordered | m | 8 | 1000000 | 160971448 | 22707598 | 0.1 |
ordered | l | 8 | 10000 | 859099 | 158184 | 0.2 |
ordered | l | 8 | 100000 | 12303855 | 1932700 | 0.2 |
ordered | l | 8 | 1000000 | 162272349 | 22793251 | 0.1 |
ordered | d | 8 | 10000 | 878319 | 163183 | 0.2 |
ordered | d | 8 | 100000 | 12501567 | 2017044 | 0.2 |
ordered | d | 8 | 1000000 | 163338570 | 23704897 | 0.1 |
ordered | t | 2 | 10000 | 359487 | 153931 | 0.4 |
ordered | t | 2 | 100000 | 5092321 | 8177886 | 1.6 |
ordered | t | 2 | 1000000 | 71484822 | 62354659 | 0.9 |
ordered | s | 2 | 10000 | 359428 | 155119 | 0.4 |
ordered | s | 2 | 100000 | 5106033 | 10950057 | 2.1 |
ordered | s | 2 | 1000000 | 71537350 | 61921390 | 0.9 |
limitedrange | j | 4 | 10000 | 358186 | 179856 | 0.5 |
limitedrange | j | 4 | 100000 | 6984247 | 4882684 | 0.7 |
limitedrange | j | 4 | 1000000 | 172545601 | 49335466 | 0.3 |
limitedrange | i | 4 | 10000 | 874782 | 188181 | 0.2 |
limitedrange | i | 4 | 100000 | 10790883 | 4756131 | 0.4 |
limitedrange | i | 4 | 1000000 | 218877264 | 52580556 | 0.2 |
limitedrange | f | 4 | 10000 | 596326 | 1542762 | 2.6 |
limitedrange | f | 4 | 100000 | 8772957 | 20883406 | 2.4 |
limitedrange | f | 4 | 1000000 | 117754398 | 245358517 | 2.1 |
limitedrange | m | 8 | 10000 | 1428448 | 182871 | 0.1 |
limitedrange | m | 8 | 100000 | 12575439 | 4792626 | 0.4 |
limitedrange | m | 8 | 1000000 | 58792036 | 50422311 | 0.9 |
limitedrange | l | 8 | 10000 | 1042609 | 184293 | 0.2 |
limitedrange | l | 8 | 100000 | 5569605 | 4849209 | 0.9 |
limitedrange | l | 8 | 1000000 | 60339919 | 50194309 | 0.8 |
limitedrange | d | 8 | 10000 | 976828 | 1465632 | 1.5 |
limitedrange | d | 8 | 100000 | 14539702 | 19981831 | 1.4 |
limitedrange | d | 8 | 1000000 | 189751693 | 238011439 | 1.3 |
limitedrange | t | 2 | 10000 | 240822 | 206838 | 0.9 |
limitedrange | t | 2 | 100000 | 8734189 | 4837590 | 0.6 |
limitedrange | t | 2 | 1000000 | 103943574 | 50969911 | 0.5 |
limitedrange | s | 2 | 10000 | 458860 | 180045 | 0.4 |
limitedrange | s | 2 | 100000 | 12315523 | 4791978 | 0.4 |
limitedrange | s | 2 | 1000000 | 103630599 | 49719604 | 0.5 |
Zen 4 based CPUs perform poorly because AMD's implementation of compressstoreu
is extremely inefficient (~142 cycles latency / throughput of 54-72 according to uops.info. It is even worse than an 'emulated' version.
I've also run the benchmark on my 7700X CPU and also get extremely poor results for Zen 4.
Thankfully, replacing calls to compressstoreu
with their emulated versions significantly improves the results.
I'm going to push the fix on a fork and will make a PR.
Thanks for the explanation. It is good that the instruction can be emulated and results can be similar to intel speedup.