arrow
arrow copied to clipboard
GH-38560: [C++][Parquet] Rewrite BYTE_STREAM_SPLIT SSE optimizations using xsimd
Rationale for this change
This is part of https://github.com/apache/arrow/issues/38560#issuecomment-1966666606 . It tried to Rewrite SSE4_2 using xsimd.
What changes are included in this PR?
Rewrite SSE4_2 using xsimd.
Are these changes tested?
Yes
Are there any user-facing changes?
no
- GitHub Issue: #38560
:warning: GitHub issue #38560 has been automatically assigned in GitHub to PR creator.
Don't know why R CI build failed, some help is need...
@cyb70289 I may ask a stupid question here:
/arrow/cpp/src/arrow/util/byte_stream_split_internal.h:161:69: error: no matching function for call to 'xsimd::batch<signed char, xsimd::neon64>::batch(xsimd::batch<int, xsimd::neon64>)'
161 | static_cast<int32_batch>(stage[2][i + 4])));
In arm64 this error is raised, how should I fix this? ( It compiles on my M1 MacOS, and I expect neon called here )
I believe I meet this problem: https://github.com/apache/arrow/pull/40335#issuecomment-1982724398 because: https://github.com/xtensor-stack/xsimd/issues/735
Should I first disable neon64? Or I can upgrade xsimd first? Or I can using other workaround? @pitrou @cyb70289
(also find some fixing like: https://github.com/xtensor-stack/xsimd/commit/836b4c359edbb34e4c4448cccd9bb4fee5e34c89 , but not released yet)
And zip_lo for neon64 is merged here: https://github.com/xtensor-stack/xsimd/commit/ead07427834c82aac105d36b8671abbe915c441c
I'll disable neon64 firstly
I think it's ok to upgrade xsimd.
@pitrou What would you think of problems here: https://github.com/apache/arrow/pull/40335#issuecomment-1983644942 . Find 12.1.1 contains some bugs here...
@mapleFU I have no idea. Perhaps @serge-sans-paille would like to advise here.
After: (On My AMD 3800x), compiler using gcc 11.4 ( WSL and CLion doesn't work well with lldb, I'll upgrade it later)
BM_ByteStreamSplitDecode_Float_Sse2/1024 268 ns 268 ns 2597941 bytes_per_second=14.2391Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/4096 1056 ns 1056 ns 659104 bytes_per_second=14.4464Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/32768 8464 ns 8464 ns 82631 bytes_per_second=14.4228Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/65536 17016 ns 17016 ns 41237 bytes_per_second=14.3476Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/1024 863 ns 863 ns 811078 bytes_per_second=8.84518Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/4096 3546 ns 3546 ns 196919 bytes_per_second=8.60728Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/32768 28309 ns 28309 ns 24734 bytes_per_second=8.62408Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/65536 56551 ns 56551 ns 12353 bytes_per_second=8.63435Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/1024 349 ns 349 ns 2002774 bytes_per_second=10.9233Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/4096 1381 ns 1381 ns 506294 bytes_per_second=11.053Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/32768 11064 ns 11064 ns 63779 bytes_per_second=11.0334Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/65536 26332 ns 26332 ns 26807 bytes_per_second=9.27175Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/1024 963 ns 963 ns 728497 bytes_per_second=7.92249Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/4096 4125 ns 4125 ns 170152 bytes_per_second=7.39747Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/32768 34597 ns 34597 ns 20206 bytes_per_second=7.05663Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/65536 69679 ns 69680 ns 10420 bytes_per_second=7.00753Gi/s
BM_ByteStreamSplitDecode_Float_Avx2/1024 230 ns 230 ns 3037165 bytes_per_second=16.5785Gi/s
BM_ByteStreamSplitDecode_Float_Avx2/4096 909 ns 909 ns 765138 bytes_per_second=16.7792Gi/s
BM_ByteStreamSplitDecode_Float_Avx2/32768 7275 ns 7275 ns 96407 bytes_per_second=16.7795Gi/s
BM_ByteStreamSplitDecode_Float_Avx2/65536 14672 ns 14672 ns 47858 bytes_per_second=16.6396Gi/s
BM_ByteStreamSplitDecode_Double_Avx2/1024 643 ns 643 ns 1086091 bytes_per_second=11.8583Gi/s
BM_ByteStreamSplitDecode_Double_Avx2/4096 2715 ns 2715 ns 257699 bytes_per_second=11.242Gi/s
BM_ByteStreamSplitDecode_Double_Avx2/32768 21646 ns 21646 ns 32293 bytes_per_second=11.2788Gi/s
BM_ByteStreamSplitDecode_Double_Avx2/65536 43594 ns 43594 ns 16003 bytes_per_second=11.2006Gi/s
BM_ByteStreamSplitEncode_Float_Avx2/1024 740 ns 740 ns 940892 bytes_per_second=5.15611Gi/s
BM_ByteStreamSplitEncode_Float_Avx2/4096 2891 ns 2891 ns 242620 bytes_per_second=5.27845Gi/s
BM_ByteStreamSplitEncode_Float_Avx2/32768 23174 ns 23174 ns 30344 bytes_per_second=5.26759Gi/s
BM_ByteStreamSplitEncode_Float_Avx2/65536 47080 ns 47080 ns 15025 bytes_per_second=5.1856Gi/s
BM_ByteStreamSplitEncode_Double_Avx2/1024 962 ns 962 ns 714181 bytes_per_second=7.92957Gi/s
BM_ByteStreamSplitEncode_Double_Avx2/4096 4206 ns 4206 ns 166235 bytes_per_second=7.25527Gi/s
BM_ByteStreamSplitEncode_Double_Avx2/32768 34696 ns 34696 ns 20041 bytes_per_second=7.03653Gi/s
BM_ByteStreamSplitEncode_Double_Avx2/65536 82677 ns 82677 ns 8268 bytes_per_second=5.90586Gi/s
Before:
BM_ByteStreamSplitDecode_Float_Sse2/1024 527 ns 527 ns 1918166 bytes_per_second=7.2438Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/4096 1789 ns 1789 ns 532823 bytes_per_second=8.52931Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/32768 11182 ns 11182 ns 77306 bytes_per_second=10.9164Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/65536 30606 ns 30605 ns 20814 bytes_per_second=7.97704Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/1024 1282 ns 1282 ns 730335 bytes_per_second=5.95065Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/4096 5093 ns 5093 ns 137810 bytes_per_second=5.99156Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/32768 42888 ns 42888 ns 13550 bytes_per_second=5.6925Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/65536 93657 ns 93649 ns 8164 bytes_per_second=5.21396Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/1024 655 ns 655 ns 1123042 bytes_per_second=5.82213Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/4096 2577 ns 2577 ns 250103 bytes_per_second=5.92139Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/32768 18899 ns 18899 ns 36646 bytes_per_second=6.45902Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/65536 40659 ns 40659 ns 20018 bytes_per_second=6.00463Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/1024 1081 ns 1078 ns 521342 bytes_per_second=7.07835Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/4096 4089 ns 4084 ns 168537 bytes_per_second=7.47223Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/32768 32269 ns 32237 ns 21543 bytes_per_second=7.57334Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/65536 65524 ns 65427 ns 10961 bytes_per_second=7.46294Gi/s
About performance change for decode:
for (int j = 0; j < kNumStreams; ++j) {
_mm_storeu_si128(
reinterpret_cast<__m128i*>(out + (i * kNumStreams + j) * sizeof(__m128i)),
stage[kNumStreamsLog2][j]);
}
change this not cast to __m128i enhance the performance
MacOS M1 Pro, compiler using LLVM-17
BM_ByteStreamSplitDecode_Float_Neon/1024 393 ns 393 ns 1781393 bytes_per_second=9.7103G/s
BM_ByteStreamSplitDecode_Float_Neon/4096 1523 ns 1522 ns 459550 bytes_per_second=10.0244G/s
BM_ByteStreamSplitDecode_Float_Neon/32768 13254 ns 13251 ns 52771 bytes_per_second=9.21235G/s
BM_ByteStreamSplitDecode_Float_Neon/65536 26862 ns 26856 ns 26041 bytes_per_second=9.09063G/s
BM_ByteStreamSplitDecode_Double_Neon/1024 1311 ns 1311 ns 534772 bytes_per_second=5.82162G/s
BM_ByteStreamSplitDecode_Double_Neon/4096 5166 ns 5165 ns 135459 bytes_per_second=5.90808G/s
BM_ByteStreamSplitDecode_Double_Neon/32768 46743 ns 46707 ns 14991 bytes_per_second=5.22712G/s
BM_ByteStreamSplitDecode_Double_Neon/65536 92789 ns 92769 ns 7546 bytes_per_second=5.26339G/s
BM_ByteStreamSplitEncode_Float_Neon/1024 565 ns 564 ns 1239926 bytes_per_second=6.7585G/s
BM_ByteStreamSplitEncode_Float_Neon/4096 2207 ns 2206 ns 317266 bytes_per_second=6.91565G/s
BM_ByteStreamSplitEncode_Float_Neon/32768 18854 ns 18847 ns 37160 bytes_per_second=6.47679G/s
BM_ByteStreamSplitEncode_Float_Neon/65536 37583 ns 37568 ns 18597 bytes_per_second=6.49871G/s
BM_ByteStreamSplitEncode_Double_Neon/1024 924 ns 924 ns 758602 bytes_per_second=8.25749G/s
BM_ByteStreamSplitEncode_Double_Neon/4096 3645 ns 3644 ns 192023 bytes_per_second=8.37507G/s
BM_ByteStreamSplitEncode_Double_Neon/32768 33733 ns 33721 ns 20762 bytes_per_second=7.23999G/s
BM_ByteStreamSplitEncode_Double_Neon/65536 69052 ns 69030 ns 10090 bytes_per_second=7.07349G/s
BM_ByteStreamSplitDecode_Float_Scalar/1024 782 ns 782 ns 895232 bytes_per_second=4.88038G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096 3115 ns 3113 ns 224499 bytes_per_second=4.90148G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768 24910 ns 24904 ns 28080 bytes_per_second=4.90165G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536 49568 ns 49555 ns 14065 bytes_per_second=4.92667G/s
BM_ByteStreamSplitDecode_Double_Scalar/1024 1567 ns 1567 ns 447342 bytes_per_second=4.86952G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096 6242 ns 6239 ns 112187 bytes_per_second=4.89108G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768 51709 ns 51700 ns 13575 bytes_per_second=4.72229G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536 103764 ns 103734 ns 6749 bytes_per_second=4.70704G/s
BM_ByteStreamSplitEncode_Float_Scalar/1024 1052 ns 1052 ns 664483 bytes_per_second=3.62624G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096 4190 ns 4189 ns 167091 bytes_per_second=3.64299G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768 33639 ns 33628 ns 20806 bytes_per_second=3.63006G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536 67292 ns 67274 ns 10410 bytes_per_second=3.62906G/s
BM_ByteStreamSplitEncode_Double_Scalar/1024 2107 ns 2106 ns 334135 bytes_per_second=3.62257G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096 8359 ns 8357 ns 83645 bytes_per_second=3.65182G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768 67293 ns 67271 ns 10369 bytes_per_second=3.62921G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536 134826 ns 134783 ns 5199 bytes_per_second=3.62273G/s
@mapleFU I have no idea. Perhaps @serge-sans-paille would like to advise here.
I don't have all the context, but upgrading xsimd looks reasonable if it fixes your issues. Would you need a new release?
MacOS M1 Pro, compiler using LLVM-17
Can you also post the _Scalar numbers for comparison?
Can you also post the _Scalar numbers for comparison?
Done
My AMD 3800X Scalar code benchmark:
BM_ByteStreamSplitDecode_Float_Scalar/1024 1321 ns 1321 ns 655835 bytes_per_second=2.88745Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/4096 4252 ns 4252 ns 163571 bytes_per_second=3.58879Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/32768 40957 ns 40957 ns 16015 bytes_per_second=2.98046Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/65536 92735 ns 92734 ns 8404 bytes_per_second=2.6327Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/1024 3991 ns 3991 ns 185298 bytes_per_second=1.9117Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/4096 11446 ns 11446 ns 59298 bytes_per_second=2.66621Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/32768 85616 ns 85615 ns 7896 bytes_per_second=2.8516Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/65536 147690 ns 147689 ns 4269 bytes_per_second=3.30614Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/1024 889 ns 889 ns 856381 bytes_per_second=4.29155Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/4096 3292 ns 3292 ns 203271 bytes_per_second=4.6351Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/32768 26056 ns 26055 ns 27125 bytes_per_second=4.68507Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/65536 52304 ns 52304 ns 13491 bytes_per_second=4.66769Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/1024 2216 ns 2216 ns 354656 bytes_per_second=3.44257Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/4096 7198 ns 7198 ns 91613 bytes_per_second=4.24001Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/32768 61507 ns 61507 ns 13072 bytes_per_second=3.96931Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/65536 125982 ns 125981 ns 5013 bytes_per_second=3.87584Gi/s
What about the macOS M1 Pro ?
What about the macOS M1 Pro ?
I've update the result here: https://github.com/apache/arrow/pull/40335#issuecomment-1984131068
Basically, it's about 2times faster
Very nice, thank you! @cyb70289 Do you have the possibility to run on other ARM CPUs?
Very nice, thank you! @cyb70289 Do you have the possibility to run on other ARM CPUs?
sure, will do
@github-actions crossbow submit -g cpp
Revision: bd415ef7364994532a9ec807e387e4b1f3aee7ff
Submitted crossbow builds: ursacomputing/crossbow @ actions-5e50b95523
@github-actions crossbow submit -g wheel
Revision: dea27751eb6df9c4cda76ddebefb32eecb762539
Submitted crossbow builds: ursacomputing/crossbow @ actions-4e1aef9966
Hmm, could you please rebase to get some CI fixes?
@github-actions crossbow submit -g wheel
Revision: 4aa6fdd69eff9b8927d55e0908e3cec5c9c23cd4
Submitted crossbow builds: ursacomputing/crossbow @ actions-761d7cbf39
I don't have all the context, but upgrading xsimd looks reasonable if it fixes your issues. Would you need a new release?
Hi @serge-sans-paille . I found some neon64 related issues here. For issue I meet in this patch(about casting types in register), now I'm using memcpy as a workaround. We can shift to new release if a release containing some bug fixes is included?
Thanks for xsimd, this is my first simd programing, and seems it's convinient with xsimd :-)
On Sat, Mar 09, 2024 at 09:37:31AM -0800, mwish wrote:
I don't have all the context, but upgrading xsimd looks reasonable if it fixes your issues. Would you need a new release?Hi @.*** . I found some neon64 related issues here. For issue I meet in this patch(about casting types in register), now I'm using memcpy as a workaround. We can shift to new release if a release containing some bug fixes is included?
Could you open a seperate bug in xsimd bug tracker with a reproducer?
Could you open a seperate bug in xsimd bug tracker with a reproducer?
Not saying the bug. I mean https://github.com/apache/arrow/pull/40335#issuecomment-1983644942 , some bugfix and neon64 related enhancement is not included in the latest release 12.1.1 ?
Would you need a new release?
Maybe a new release would be better
Tested on Neoverse-N1. For clang, I see performance improvement from both encoder and decode. But for gcc, there's some drop from the encoder.
- clang-16, improvement from both encoder and decoder
decode (improve)
----------------
BM_ByteStreamSplitDecode_Float_Scalar/1024 1167 ns 1167 ns 600395 bytes_per_second=3.27015Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/4096 4648 ns 4648 ns 150615 bytes_per_second=3.28313Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/32768 38248 ns 38247 ns 18300 bytes_per_second=3.19159Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/65536 76448 ns 76446 ns 9159 bytes_per_second=3.19363Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/1024 2814 ns 2814 ns 248735 bytes_per_second=2.71086Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/4096 11236 ns 11236 ns 62307 bytes_per_second=2.7161Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/32768 92623 ns 92616 ns 7551 bytes_per_second=2.63604Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/65536 188190 ns 188185 ns 3728 bytes_per_second=2.59469Gi/s
BM_ByteStreamSplitDecode_Float_Neon/1024 817 ns 817 ns 856316 bytes_per_second=4.66674Gi/s
BM_ByteStreamSplitDecode_Float_Neon/4096 3240 ns 3240 ns 216075 bytes_per_second=4.71005Gi/s
BM_ByteStreamSplitDecode_Float_Neon/32768 26981 ns 26981 ns 25942 bytes_per_second=4.52429Gi/s
BM_ByteStreamSplitDecode_Float_Neon/65536 54189 ns 54186 ns 12924 bytes_per_second=4.50564Gi/s
BM_ByteStreamSplitDecode_Double_Neon/1024 1767 ns 1767 ns 396110 bytes_per_second=4.31715Gi/s
BM_ByteStreamSplitDecode_Double_Neon/4096 7138 ns 7137 ns 98106 bytes_per_second=4.27568Gi/s
BM_ByteStreamSplitDecode_Double_Neon/32768 64999 ns 64997 ns 10779 bytes_per_second=3.75616Gi/s
BM_ByteStreamSplitDecode_Double_Neon/65536 130243 ns 130243 ns 5366 bytes_per_second=3.74901Gi/s
encode (improve)
----------------
BM_ByteStreamSplitEncode_Float_Scalar/1024 1482 ns 1482 ns 472507 bytes_per_second=2.57419Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/4096 5897 ns 5897 ns 118700 bytes_per_second=2.58776Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/32768 47959 ns 47956 ns 14597 bytes_per_second=2.54548Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/65536 95903 ns 95896 ns 7298 bytes_per_second=2.54588Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/1024 2950 ns 2950 ns 237274 bytes_per_second=2.58627Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/4096 11786 ns 11786 ns 59393 bytes_per_second=2.58938Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/32768 98141 ns 98138 ns 7133 bytes_per_second=2.48773Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/65536 198219 ns 198203 ns 3531 bytes_per_second=2.46354Gi/s
BM_ByteStreamSplitEncode_Float_Neon/1024 1152 ns 1152 ns 607844 bytes_per_second=3.31275Gi/s
BM_ByteStreamSplitEncode_Float_Neon/4096 4571 ns 4570 ns 153146 bytes_per_second=3.33858Gi/s
BM_ByteStreamSplitEncode_Float_Neon/32768 37086 ns 37084 ns 18873 bytes_per_second=3.29172Gi/s
BM_ByteStreamSplitEncode_Float_Neon/65536 74336 ns 74336 ns 9417 bytes_per_second=3.2843Gi/s
BM_ByteStreamSplitEncode_Double_Neon/1024 1978 ns 1978 ns 353156 bytes_per_second=3.85706Gi/s
BM_ByteStreamSplitEncode_Double_Neon/4096 7947 ns 7947 ns 87879 bytes_per_second=3.84032Gi/s
BM_ByteStreamSplitEncode_Double_Neon/32768 64458 ns 64458 ns 10863 bytes_per_second=3.7876Gi/s
BM_ByteStreamSplitEncode_Double_Neon/65536 128693 ns 128689 ns 5440 bytes_per_second=3.79428Gi/s
- gcc-13, decoder improves, but encoder drops
decode (improve)
----------------
BM_ByteStreamSplitDecode_Float_Scalar/1024 1133 ns 1133 ns 617695 bytes_per_second=3.3663Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/4096 4484 ns 4484 ns 156105 bytes_per_second=3.40284Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/32768 36318 ns 36318 ns 19273 bytes_per_second=3.36116Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/65536 73048 ns 73047 ns 9554 bytes_per_second=3.34225Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/1024 2814 ns 2814 ns 248738 bytes_per_second=2.7114Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/4096 11227 ns 11226 ns 62355 bytes_per_second=2.71838Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/32768 92482 ns 92478 ns 7552 bytes_per_second=2.64Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/65536 185853 ns 185844 ns 3748 bytes_per_second=2.62737Gi/s
BM_ByteStreamSplitDecode_Float_Neon/1024 775 ns 775 ns 903307 bytes_per_second=4.92282Gi/s
BM_ByteStreamSplitDecode_Float_Neon/4096 3061 ns 3061 ns 228720 bytes_per_second=4.98565Gi/s
BM_ByteStreamSplitDecode_Float_Neon/32768 25543 ns 25542 ns 27405 bytes_per_second=4.77925Gi/s
BM_ByteStreamSplitDecode_Float_Neon/65536 51478 ns 51474 ns 13609 bytes_per_second=4.74294Gi/s
BM_ByteStreamSplitDecode_Double_Neon/1024 1626 ns 1626 ns 429095 bytes_per_second=4.69278Gi/s
BM_ByteStreamSplitDecode_Double_Neon/4096 6485 ns 6485 ns 107513 bytes_per_second=4.70567Gi/s
BM_ByteStreamSplitDecode_Double_Neon/32768 59680 ns 59680 ns 11757 bytes_per_second=4.09083Gi/s
BM_ByteStreamSplitDecode_Double_Neon/65536 120697 ns 120688 ns 5594 bytes_per_second=4.04582Gi/s
encode (drop)
-------------
BM_ByteStreamSplitEncode_Float_Scalar/1024 1142 ns 1142 ns 613228 bytes_per_second=3.34041Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/4096 4511 ns 4511 ns 155178 bytes_per_second=3.3825Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/32768 37560 ns 37560 ns 18636 bytes_per_second=3.25003Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/65536 75348 ns 75343 ns 9301 bytes_per_second=3.2404Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/1024 2201 ns 2201 ns 318028 bytes_per_second=3.46606Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/4096 8795 ns 8795 ns 79615 bytes_per_second=3.46994Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/32768 77388 ns 77383 ns 9045 bytes_per_second=3.15497Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/65536 153900 ns 153900 ns 4543 bytes_per_second=3.17272Gi/s
BM_ByteStreamSplitEncode_Float_Neon/1024 1238 ns 1238 ns 565551 bytes_per_second=3.08201Gi/s
BM_ByteStreamSplitEncode_Float_Neon/4096 4894 ns 4893 ns 143073 bytes_per_second=3.11821Gi/s
BM_ByteStreamSplitEncode_Float_Neon/32768 39594 ns 39594 ns 17679 bytes_per_second=3.08304Gi/s
BM_ByteStreamSplitEncode_Float_Neon/65536 79201 ns 79200 ns 8838 bytes_per_second=3.0826Gi/s
BM_ByteStreamSplitEncode_Double_Neon/1024 2573 ns 2573 ns 272609 bytes_per_second=2.96532Gi/s
BM_ByteStreamSplitEncode_Double_Neon/4096 10249 ns 10248 ns 68149 bytes_per_second=2.97782Gi/s
BM_ByteStreamSplitEncode_Double_Neon/32768 88791 ns 88791 ns 7884 bytes_per_second=2.7496Gi/s
BM_ByteStreamSplitEncode_Double_Neon/65536 176888 ns 176888 ns 3958 bytes_per_second=2.7604Gi/s
Did a quick profiling. For gcc, looks Neon code doesn't save total instructions. Normalize instructions by Iterations gives similar result for both the scalar and neon benchmark.
ASE_SPEC in output means total Neon (asimd) instructions.
- profile scalar encode
perf stat -e ASE_SPEC,instructions,cycles -- release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitEncode_Double_Scalar/65536
-------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitEncode_Double_Scalar/65536 152798 ns 152798 ns 4586 bytes_per_second=3.19561Gi/s
37,622 ASE_SPEC
9,189,030,863 instructions # 3.48 insn per cycle
2,643,007,625 cycles
- profile neon encode
perf stat -e ASE_SPEC,instructions,cycles -- release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitEncode_Double_Neon/65536
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BM_ByteStreamSplitEncode_Double_Neon/65536 177150 ns 177149 ns 3948 bytes_per_second=2.75633Gi/s
2,985,430,024 ASE_SPEC
7,998,814,771 instructions # 2.94 insn per cycle
2,718,202,634 cycles