arrow GH-38560: [C++][Parquet] Rewrite BYTE_STREAM_SPLIT SSE optimizations using xsimd

Rationale for this change

This is part of https://github.com/apache/arrow/issues/38560#issuecomment-1966666606 . It tried to Rewrite SSE4_2 using xsimd.

What changes are included in this PR?

Rewrite SSE4_2 using xsimd.

Are these changes tested?

Yes

Are there any user-facing changes?

no

GitHub Issue: #38560

Mar 04 '24 11:03 mapleFU

:warning: GitHub issue #38560 has been automatically assigned in GitHub to PR creator.

Mar 04 '24 11:03 github-actions[bot]

Don't know why R CI build failed, some help is need...

Mar 07 '24 07:03 mapleFU

@cyb70289 I may ask a stupid question here:

/arrow/cpp/src/arrow/util/byte_stream_split_internal.h:161:69: error: no matching function for call to 'xsimd::batch<signed char, xsimd::neon64>::batch(xsimd::batch<int, xsimd::neon64>)'
  161 |                           static_cast<int32_batch>(stage[2][i + 4])));

In arm64 this error is raised, how should I fix this? ( It compiles on my M1 MacOS, and I expect neon called here )

Mar 07 '24 07:03 mapleFU

I believe I meet this problem: https://github.com/apache/arrow/pull/40335#issuecomment-1982724398 because: https://github.com/xtensor-stack/xsimd/issues/735

Should I first disable neon64? Or I can upgrade xsimd first? Or I can using other workaround? @pitrou @cyb70289

Mar 07 '24 14:03 mapleFU

(also find some fixing like: https://github.com/xtensor-stack/xsimd/commit/836b4c359edbb34e4c4448cccd9bb4fee5e34c89 , but not released yet)

And zip_lo for neon64 is merged here: https://github.com/xtensor-stack/xsimd/commit/ead07427834c82aac105d36b8671abbe915c441c

I'll disable neon64 firstly

Mar 07 '24 14:03 mapleFU

I think it's ok to upgrade xsimd.

Mar 07 '24 15:03 pitrou

@pitrou What would you think of problems here: https://github.com/apache/arrow/pull/40335#issuecomment-1983644942 . Find 12.1.1 contains some bugs here...

Mar 07 '24 15:03 mapleFU

@mapleFU I have no idea. Perhaps @serge-sans-paille would like to advise here.

Mar 07 '24 15:03 pitrou

After: (On My AMD 3800x), compiler using gcc 11.4 ( WSL and CLion doesn't work well with lldb, I'll upgrade it later)

BM_ByteStreamSplitDecode_Float_Sse2/1024            268 ns          268 ns      2597941 bytes_per_second=14.2391Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/4096           1056 ns         1056 ns       659104 bytes_per_second=14.4464Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/32768          8464 ns         8464 ns        82631 bytes_per_second=14.4228Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/65536         17016 ns        17016 ns        41237 bytes_per_second=14.3476Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/1024           863 ns          863 ns       811078 bytes_per_second=8.84518Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/4096          3546 ns         3546 ns       196919 bytes_per_second=8.60728Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/32768        28309 ns        28309 ns        24734 bytes_per_second=8.62408Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/65536        56551 ns        56551 ns        12353 bytes_per_second=8.63435Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/1024            349 ns          349 ns      2002774 bytes_per_second=10.9233Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/4096           1381 ns         1381 ns       506294 bytes_per_second=11.053Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/32768         11064 ns        11064 ns        63779 bytes_per_second=11.0334Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/65536         26332 ns        26332 ns        26807 bytes_per_second=9.27175Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/1024           963 ns          963 ns       728497 bytes_per_second=7.92249Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/4096          4125 ns         4125 ns       170152 bytes_per_second=7.39747Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/32768        34597 ns        34597 ns        20206 bytes_per_second=7.05663Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/65536        69679 ns        69680 ns        10420 bytes_per_second=7.00753Gi/s
BM_ByteStreamSplitDecode_Float_Avx2/1024            230 ns          230 ns      3037165 bytes_per_second=16.5785Gi/s
BM_ByteStreamSplitDecode_Float_Avx2/4096            909 ns          909 ns       765138 bytes_per_second=16.7792Gi/s
BM_ByteStreamSplitDecode_Float_Avx2/32768          7275 ns         7275 ns        96407 bytes_per_second=16.7795Gi/s
BM_ByteStreamSplitDecode_Float_Avx2/65536         14672 ns        14672 ns        47858 bytes_per_second=16.6396Gi/s
BM_ByteStreamSplitDecode_Double_Avx2/1024           643 ns          643 ns      1086091 bytes_per_second=11.8583Gi/s
BM_ByteStreamSplitDecode_Double_Avx2/4096          2715 ns         2715 ns       257699 bytes_per_second=11.242Gi/s
BM_ByteStreamSplitDecode_Double_Avx2/32768        21646 ns        21646 ns        32293 bytes_per_second=11.2788Gi/s
BM_ByteStreamSplitDecode_Double_Avx2/65536        43594 ns        43594 ns        16003 bytes_per_second=11.2006Gi/s
BM_ByteStreamSplitEncode_Float_Avx2/1024            740 ns          740 ns       940892 bytes_per_second=5.15611Gi/s
BM_ByteStreamSplitEncode_Float_Avx2/4096           2891 ns         2891 ns       242620 bytes_per_second=5.27845Gi/s
BM_ByteStreamSplitEncode_Float_Avx2/32768         23174 ns        23174 ns        30344 bytes_per_second=5.26759Gi/s
BM_ByteStreamSplitEncode_Float_Avx2/65536         47080 ns        47080 ns        15025 bytes_per_second=5.1856Gi/s
BM_ByteStreamSplitEncode_Double_Avx2/1024           962 ns          962 ns       714181 bytes_per_second=7.92957Gi/s
BM_ByteStreamSplitEncode_Double_Avx2/4096          4206 ns         4206 ns       166235 bytes_per_second=7.25527Gi/s
BM_ByteStreamSplitEncode_Double_Avx2/32768        34696 ns        34696 ns        20041 bytes_per_second=7.03653Gi/s
BM_ByteStreamSplitEncode_Double_Avx2/65536        82677 ns        82677 ns         8268 bytes_per_second=5.90586Gi/s

Before:

BM_ByteStreamSplitDecode_Float_Sse2/1024            527 ns          527 ns      1918166 bytes_per_second=7.2438Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/4096           1789 ns         1789 ns       532823 bytes_per_second=8.52931Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/32768         11182 ns        11182 ns        77306 bytes_per_second=10.9164Gi/s
BM_ByteStreamSplitDecode_Float_Sse2/65536         30606 ns        30605 ns        20814 bytes_per_second=7.97704Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/1024          1282 ns         1282 ns       730335 bytes_per_second=5.95065Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/4096          5093 ns         5093 ns       137810 bytes_per_second=5.99156Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/32768        42888 ns        42888 ns        13550 bytes_per_second=5.6925Gi/s
BM_ByteStreamSplitDecode_Double_Sse2/65536        93657 ns        93649 ns         8164 bytes_per_second=5.21396Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/1024            655 ns          655 ns      1123042 bytes_per_second=5.82213Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/4096           2577 ns         2577 ns       250103 bytes_per_second=5.92139Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/32768         18899 ns        18899 ns        36646 bytes_per_second=6.45902Gi/s
BM_ByteStreamSplitEncode_Float_Sse2/65536         40659 ns        40659 ns        20018 bytes_per_second=6.00463Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/1024          1081 ns         1078 ns       521342 bytes_per_second=7.07835Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/4096          4089 ns         4084 ns       168537 bytes_per_second=7.47223Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/32768        32269 ns        32237 ns        21543 bytes_per_second=7.57334Gi/s
BM_ByteStreamSplitEncode_Double_Sse2/65536        65524 ns        65427 ns        10961 bytes_per_second=7.46294Gi/s

About performance change for decode:

    for (int j = 0; j < kNumStreams; ++j) {
      _mm_storeu_si128(
          reinterpret_cast<__m128i*>(out + (i * kNumStreams + j) * sizeof(__m128i)),
          stage[kNumStreamsLog2][j]);
    }

change this not cast to __m128i enhance the performance

Mar 07 '24 17:03 mapleFU

MacOS M1 Pro, compiler using LLVM-17

BM_ByteStreamSplitDecode_Float_Neon/1024            393 ns          393 ns      1781393 bytes_per_second=9.7103G/s
BM_ByteStreamSplitDecode_Float_Neon/4096           1523 ns         1522 ns       459550 bytes_per_second=10.0244G/s
BM_ByteStreamSplitDecode_Float_Neon/32768         13254 ns        13251 ns        52771 bytes_per_second=9.21235G/s
BM_ByteStreamSplitDecode_Float_Neon/65536         26862 ns        26856 ns        26041 bytes_per_second=9.09063G/s
BM_ByteStreamSplitDecode_Double_Neon/1024          1311 ns         1311 ns       534772 bytes_per_second=5.82162G/s
BM_ByteStreamSplitDecode_Double_Neon/4096          5166 ns         5165 ns       135459 bytes_per_second=5.90808G/s
BM_ByteStreamSplitDecode_Double_Neon/32768        46743 ns        46707 ns        14991 bytes_per_second=5.22712G/s
BM_ByteStreamSplitDecode_Double_Neon/65536        92789 ns        92769 ns         7546 bytes_per_second=5.26339G/s
BM_ByteStreamSplitEncode_Float_Neon/1024            565 ns          564 ns      1239926 bytes_per_second=6.7585G/s
BM_ByteStreamSplitEncode_Float_Neon/4096           2207 ns         2206 ns       317266 bytes_per_second=6.91565G/s
BM_ByteStreamSplitEncode_Float_Neon/32768         18854 ns        18847 ns        37160 bytes_per_second=6.47679G/s
BM_ByteStreamSplitEncode_Float_Neon/65536         37583 ns        37568 ns        18597 bytes_per_second=6.49871G/s
BM_ByteStreamSplitEncode_Double_Neon/1024           924 ns          924 ns       758602 bytes_per_second=8.25749G/s
BM_ByteStreamSplitEncode_Double_Neon/4096          3645 ns         3644 ns       192023 bytes_per_second=8.37507G/s
BM_ByteStreamSplitEncode_Double_Neon/32768        33733 ns        33721 ns        20762 bytes_per_second=7.23999G/s
BM_ByteStreamSplitEncode_Double_Neon/65536        69052 ns        69030 ns        10090 bytes_per_second=7.07349G/s

BM_ByteStreamSplitDecode_Float_Scalar/1024          782 ns          782 ns       895232 bytes_per_second=4.88038G/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         3115 ns         3113 ns       224499 bytes_per_second=4.90148G/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       24910 ns        24904 ns        28080 bytes_per_second=4.90165G/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       49568 ns        49555 ns        14065 bytes_per_second=4.92667G/s
BM_ByteStreamSplitDecode_Double_Scalar/1024        1567 ns         1567 ns       447342 bytes_per_second=4.86952G/s
BM_ByteStreamSplitDecode_Double_Scalar/4096        6242 ns         6239 ns       112187 bytes_per_second=4.89108G/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      51709 ns        51700 ns        13575 bytes_per_second=4.72229G/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     103764 ns       103734 ns         6749 bytes_per_second=4.70704G/s
BM_ByteStreamSplitEncode_Float_Scalar/1024         1052 ns         1052 ns       664483 bytes_per_second=3.62624G/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         4190 ns         4189 ns       167091 bytes_per_second=3.64299G/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       33639 ns        33628 ns        20806 bytes_per_second=3.63006G/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       67292 ns        67274 ns        10410 bytes_per_second=3.62906G/s
BM_ByteStreamSplitEncode_Double_Scalar/1024        2107 ns         2106 ns       334135 bytes_per_second=3.62257G/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        8359 ns         8357 ns        83645 bytes_per_second=3.65182G/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      67293 ns        67271 ns        10369 bytes_per_second=3.62921G/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     134826 ns       134783 ns         5199 bytes_per_second=3.62273G/s

Mar 07 '24 18:03 mapleFU

@mapleFU I have no idea. Perhaps @serge-sans-paille would like to advise here.

I don't have all the context, but upgrading xsimd looks reasonable if it fixes your issues. Would you need a new release?

Mar 07 '24 22:03 serge-sans-paille

MacOS M1 Pro, compiler using LLVM-17

Can you also post the _Scalar numbers for comparison?

Mar 07 '24 22:03 pitrou

Can you also post the _Scalar numbers for comparison?

Done

Mar 08 '24 02:03 mapleFU

My AMD 3800X Scalar code benchmark:

BM_ByteStreamSplitDecode_Float_Scalar/1024         1321 ns         1321 ns       655835 bytes_per_second=2.88745Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         4252 ns         4252 ns       163571 bytes_per_second=3.58879Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       40957 ns        40957 ns        16015 bytes_per_second=2.98046Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       92735 ns        92734 ns         8404 bytes_per_second=2.6327Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/1024        3991 ns         3991 ns       185298 bytes_per_second=1.9117Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       11446 ns        11446 ns        59298 bytes_per_second=2.66621Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      85616 ns        85615 ns         7896 bytes_per_second=2.8516Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     147690 ns       147689 ns         4269 bytes_per_second=3.30614Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/1024          889 ns          889 ns       856381 bytes_per_second=4.29155Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         3292 ns         3292 ns       203271 bytes_per_second=4.6351Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       26056 ns        26055 ns        27125 bytes_per_second=4.68507Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       52304 ns        52304 ns        13491 bytes_per_second=4.66769Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/1024        2216 ns         2216 ns       354656 bytes_per_second=3.44257Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        7198 ns         7198 ns        91613 bytes_per_second=4.24001Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      61507 ns        61507 ns        13072 bytes_per_second=3.96931Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     125982 ns       125981 ns         5013 bytes_per_second=3.87584Gi/s

Mar 08 '24 05:03 mapleFU

What about the macOS M1 Pro ?

Mar 08 '24 09:03 pitrou

What about the macOS M1 Pro ?

I've update the result here: https://github.com/apache/arrow/pull/40335#issuecomment-1984131068

Basically, it's about 2times faster

Mar 08 '24 14:03 mapleFU

Very nice, thank you! @cyb70289 Do you have the possibility to run on other ARM CPUs?

Mar 08 '24 14:03 pitrou

Very nice, thank you! @cyb70289 Do you have the possibility to run on other ARM CPUs?

sure, will do

Mar 08 '24 14:03 cyb70289

@github-actions crossbow submit -g cpp

Mar 08 '24 17:03 pitrou

Revision: bd415ef7364994532a9ec807e387e4b1f3aee7ff

Submitted crossbow builds: ursacomputing/crossbow @ actions-5e50b95523

Task	Status
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp
test-debian-11-cpp-amd64
test-debian-11-cpp-i386
test-fedora-39-cpp
test-ubuntu-20.04-cpp
test-ubuntu-20.04-cpp-bundled
test-ubuntu-20.04-cpp-minimal-with-formats
test-ubuntu-20.04-cpp-thread-sanitizer
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-gcc-14

Mar 08 '24 17:03 github-actions[bot]

@github-actions crossbow submit -g wheel

Mar 09 '24 15:03 pitrou

Revision: dea27751eb6df9c4cda76ddebefb32eecb762539

Submitted crossbow builds: ursacomputing/crossbow @ actions-4e1aef9966

Task	Status
wheel-macos-big-sur-cp310-arm64
wheel-macos-big-sur-cp311-arm64
wheel-macos-big-sur-cp312-arm64
wheel-macos-big-sur-cp38-arm64
wheel-macos-big-sur-cp39-arm64
wheel-macos-catalina-cp310-amd64
wheel-macos-catalina-cp311-amd64
wheel-macos-catalina-cp312-amd64
wheel-macos-catalina-cp38-amd64
wheel-macos-catalina-cp39-amd64
wheel-manylinux-2-28-cp310-amd64
wheel-manylinux-2-28-cp310-arm64
wheel-manylinux-2-28-cp311-amd64
wheel-manylinux-2-28-cp311-arm64
wheel-manylinux-2-28-cp312-amd64
wheel-manylinux-2-28-cp312-arm64
wheel-manylinux-2-28-cp38-amd64
wheel-manylinux-2-28-cp38-arm64
wheel-manylinux-2-28-cp39-amd64
wheel-manylinux-2-28-cp39-arm64
wheel-manylinux-2014-cp310-amd64
wheel-manylinux-2014-cp310-arm64
wheel-manylinux-2014-cp311-amd64
wheel-manylinux-2014-cp311-arm64
wheel-manylinux-2014-cp312-amd64
wheel-manylinux-2014-cp312-arm64
wheel-manylinux-2014-cp38-amd64
wheel-manylinux-2014-cp38-arm64
wheel-manylinux-2014-cp39-amd64
wheel-manylinux-2014-cp39-arm64
wheel-windows-cp310-amd64
wheel-windows-cp311-amd64
wheel-windows-cp312-amd64
wheel-windows-cp38-amd64
wheel-windows-cp39-amd64

Mar 09 '24 15:03 github-actions[bot]

Hmm, could you please rebase to get some CI fixes?

Mar 09 '24 15:03 pitrou

@github-actions crossbow submit -g wheel

Mar 09 '24 17:03 mapleFU

Revision: 4aa6fdd69eff9b8927d55e0908e3cec5c9c23cd4

Submitted crossbow builds: ursacomputing/crossbow @ actions-761d7cbf39

Task	Status
wheel-macos-big-sur-cp310-arm64
wheel-macos-big-sur-cp311-arm64
wheel-macos-big-sur-cp312-arm64
wheel-macos-big-sur-cp38-arm64
wheel-macos-big-sur-cp39-arm64
wheel-macos-catalina-cp310-amd64
wheel-macos-catalina-cp311-amd64
wheel-macos-catalina-cp312-amd64
wheel-macos-catalina-cp38-amd64
wheel-macos-catalina-cp39-amd64
wheel-manylinux-2-28-cp310-amd64
wheel-manylinux-2-28-cp310-arm64
wheel-manylinux-2-28-cp311-amd64
wheel-manylinux-2-28-cp311-arm64
wheel-manylinux-2-28-cp312-amd64
wheel-manylinux-2-28-cp312-arm64
wheel-manylinux-2-28-cp38-amd64
wheel-manylinux-2-28-cp38-arm64
wheel-manylinux-2-28-cp39-amd64
wheel-manylinux-2-28-cp39-arm64
wheel-manylinux-2014-cp310-amd64
wheel-manylinux-2014-cp310-arm64
wheel-manylinux-2014-cp311-amd64
wheel-manylinux-2014-cp311-arm64
wheel-manylinux-2014-cp312-amd64
wheel-manylinux-2014-cp312-arm64
wheel-manylinux-2014-cp38-amd64
wheel-manylinux-2014-cp38-arm64
wheel-manylinux-2014-cp39-amd64
wheel-manylinux-2014-cp39-arm64
wheel-windows-cp310-amd64
wheel-windows-cp311-amd64
wheel-windows-cp312-amd64
wheel-windows-cp38-amd64
wheel-windows-cp39-amd64

Mar 09 '24 17:03 github-actions[bot]

I don't have all the context, but upgrading xsimd looks reasonable if it fixes your issues. Would you need a new release?

Hi @serge-sans-paille . I found some neon64 related issues here. For issue I meet in this patch(about casting types in register), now I'm using memcpy as a workaround. We can shift to new release if a release containing some bug fixes is included?

Thanks for xsimd, this is my first simd programing, and seems it's convinient with xsimd :-)

Mar 09 '24 17:03 mapleFU

On Sat, Mar 09, 2024 at 09:37:31AM -0800, mwish wrote:

I don't have all the context, but upgrading xsimd looks reasonable if it
fixes your issues. Would you need a new release?
Hi @.*** . I found some neon64 related issues here. For issue I meet in this patch(about casting types in register), now I'm using memcpy as a workaround. We can shift to new release if a release containing some bug fixes is included?

Could you open a seperate bug in xsimd bug tracker with a reproducer?

Mar 09 '24 21:03 serge-sans-paille

Could you open a seperate bug in xsimd bug tracker with a reproducer?

Not saying the bug. I mean https://github.com/apache/arrow/pull/40335#issuecomment-1983644942 , some bugfix and neon64 related enhancement is not included in the latest release 12.1.1 ?

Would you need a new release?

Maybe a new release would be better

Mar 09 '24 21:03 mapleFU

Tested on Neoverse-N1. For clang, I see performance improvement from both encoder and decode. But for gcc, there's some drop from the encoder.

- clang-16, improvement from both encoder and decoder

decode (improve)
----------------
BM_ByteStreamSplitDecode_Float_Scalar/1024         1167 ns         1167 ns       600395 bytes_per_second=3.27015Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         4648 ns         4648 ns       150615 bytes_per_second=3.28313Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       38248 ns        38247 ns        18300 bytes_per_second=3.19159Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       76448 ns        76446 ns         9159 bytes_per_second=3.19363Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/1024        2814 ns         2814 ns       248735 bytes_per_second=2.71086Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       11236 ns        11236 ns        62307 bytes_per_second=2.7161Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      92623 ns        92616 ns         7551 bytes_per_second=2.63604Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     188190 ns       188185 ns         3728 bytes_per_second=2.59469Gi/s

BM_ByteStreamSplitDecode_Float_Neon/1024            817 ns          817 ns       856316 bytes_per_second=4.66674Gi/s
BM_ByteStreamSplitDecode_Float_Neon/4096           3240 ns         3240 ns       216075 bytes_per_second=4.71005Gi/s
BM_ByteStreamSplitDecode_Float_Neon/32768         26981 ns        26981 ns        25942 bytes_per_second=4.52429Gi/s
BM_ByteStreamSplitDecode_Float_Neon/65536         54189 ns        54186 ns        12924 bytes_per_second=4.50564Gi/s
BM_ByteStreamSplitDecode_Double_Neon/1024          1767 ns         1767 ns       396110 bytes_per_second=4.31715Gi/s
BM_ByteStreamSplitDecode_Double_Neon/4096          7138 ns         7137 ns        98106 bytes_per_second=4.27568Gi/s
BM_ByteStreamSplitDecode_Double_Neon/32768        64999 ns        64997 ns        10779 bytes_per_second=3.75616Gi/s
BM_ByteStreamSplitDecode_Double_Neon/65536       130243 ns       130243 ns         5366 bytes_per_second=3.74901Gi/s

encode (improve)
----------------
BM_ByteStreamSplitEncode_Float_Scalar/1024         1482 ns         1482 ns       472507 bytes_per_second=2.57419Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         5897 ns         5897 ns       118700 bytes_per_second=2.58776Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       47959 ns        47956 ns        14597 bytes_per_second=2.54548Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       95903 ns        95896 ns         7298 bytes_per_second=2.54588Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/1024        2950 ns         2950 ns       237274 bytes_per_second=2.58627Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/4096       11786 ns        11786 ns        59393 bytes_per_second=2.58938Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      98141 ns        98138 ns         7133 bytes_per_second=2.48773Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     198219 ns       198203 ns         3531 bytes_per_second=2.46354Gi/s

BM_ByteStreamSplitEncode_Float_Neon/1024           1152 ns         1152 ns       607844 bytes_per_second=3.31275Gi/s
BM_ByteStreamSplitEncode_Float_Neon/4096           4571 ns         4570 ns       153146 bytes_per_second=3.33858Gi/s
BM_ByteStreamSplitEncode_Float_Neon/32768         37086 ns        37084 ns        18873 bytes_per_second=3.29172Gi/s
BM_ByteStreamSplitEncode_Float_Neon/65536         74336 ns        74336 ns         9417 bytes_per_second=3.2843Gi/s
BM_ByteStreamSplitEncode_Double_Neon/1024          1978 ns         1978 ns       353156 bytes_per_second=3.85706Gi/s
BM_ByteStreamSplitEncode_Double_Neon/4096          7947 ns         7947 ns        87879 bytes_per_second=3.84032Gi/s
BM_ByteStreamSplitEncode_Double_Neon/32768        64458 ns        64458 ns        10863 bytes_per_second=3.7876Gi/s
BM_ByteStreamSplitEncode_Double_Neon/65536       128693 ns       128689 ns         5440 bytes_per_second=3.79428Gi/s

- gcc-13, decoder improves, but encoder drops

decode (improve)
----------------
BM_ByteStreamSplitDecode_Float_Scalar/1024         1133 ns         1133 ns       617695 bytes_per_second=3.3663Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/4096         4484 ns         4484 ns       156105 bytes_per_second=3.40284Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/32768       36318 ns        36318 ns        19273 bytes_per_second=3.36116Gi/s
BM_ByteStreamSplitDecode_Float_Scalar/65536       73048 ns        73047 ns         9554 bytes_per_second=3.34225Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/1024        2814 ns         2814 ns       248738 bytes_per_second=2.7114Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/4096       11227 ns        11226 ns        62355 bytes_per_second=2.71838Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/32768      92482 ns        92478 ns         7552 bytes_per_second=2.64Gi/s
BM_ByteStreamSplitDecode_Double_Scalar/65536     185853 ns       185844 ns         3748 bytes_per_second=2.62737Gi/s

BM_ByteStreamSplitDecode_Float_Neon/1024            775 ns          775 ns       903307 bytes_per_second=4.92282Gi/s
BM_ByteStreamSplitDecode_Float_Neon/4096           3061 ns         3061 ns       228720 bytes_per_second=4.98565Gi/s
BM_ByteStreamSplitDecode_Float_Neon/32768         25543 ns        25542 ns        27405 bytes_per_second=4.77925Gi/s
BM_ByteStreamSplitDecode_Float_Neon/65536         51478 ns        51474 ns        13609 bytes_per_second=4.74294Gi/s
BM_ByteStreamSplitDecode_Double_Neon/1024          1626 ns         1626 ns       429095 bytes_per_second=4.69278Gi/s
BM_ByteStreamSplitDecode_Double_Neon/4096          6485 ns         6485 ns       107513 bytes_per_second=4.70567Gi/s
BM_ByteStreamSplitDecode_Double_Neon/32768        59680 ns        59680 ns        11757 bytes_per_second=4.09083Gi/s
BM_ByteStreamSplitDecode_Double_Neon/65536       120697 ns       120688 ns         5594 bytes_per_second=4.04582Gi/s

encode (drop)
-------------
BM_ByteStreamSplitEncode_Float_Scalar/1024         1142 ns         1142 ns       613228 bytes_per_second=3.34041Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/4096         4511 ns         4511 ns       155178 bytes_per_second=3.3825Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/32768       37560 ns        37560 ns        18636 bytes_per_second=3.25003Gi/s
BM_ByteStreamSplitEncode_Float_Scalar/65536       75348 ns        75343 ns         9301 bytes_per_second=3.2404Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/1024        2201 ns         2201 ns       318028 bytes_per_second=3.46606Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/4096        8795 ns         8795 ns        79615 bytes_per_second=3.46994Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/32768      77388 ns        77383 ns         9045 bytes_per_second=3.15497Gi/s
BM_ByteStreamSplitEncode_Double_Scalar/65536     153900 ns       153900 ns         4543 bytes_per_second=3.17272Gi/s

BM_ByteStreamSplitEncode_Float_Neon/1024           1238 ns         1238 ns       565551 bytes_per_second=3.08201Gi/s
BM_ByteStreamSplitEncode_Float_Neon/4096           4894 ns         4893 ns       143073 bytes_per_second=3.11821Gi/s
BM_ByteStreamSplitEncode_Float_Neon/32768         39594 ns        39594 ns        17679 bytes_per_second=3.08304Gi/s
BM_ByteStreamSplitEncode_Float_Neon/65536         79201 ns        79200 ns         8838 bytes_per_second=3.0826Gi/s
BM_ByteStreamSplitEncode_Double_Neon/1024          2573 ns         2573 ns       272609 bytes_per_second=2.96532Gi/s
BM_ByteStreamSplitEncode_Double_Neon/4096         10249 ns        10248 ns        68149 bytes_per_second=2.97782Gi/s
BM_ByteStreamSplitEncode_Double_Neon/32768        88791 ns        88791 ns         7884 bytes_per_second=2.7496Gi/s
BM_ByteStreamSplitEncode_Double_Neon/65536       176888 ns       176888 ns         3958 bytes_per_second=2.7604Gi/s

Mar 11 '24 02:03 cyb70289

Did a quick profiling. For gcc, looks Neon code doesn't save total instructions. Normalize instructions by Iterations gives similar result for both the scalar and neon benchmark. ASE_SPEC in output means total Neon (asimd) instructions.

profile scalar encode

perf stat -e ASE_SPEC,instructions,cycles -- release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitEncode_Double_Scalar/65536

-------------------------------------------------------------------------------------------------------
Benchmark                                             Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------
BM_ByteStreamSplitEncode_Double_Scalar/65536     152798 ns       152798 ns         4586 bytes_per_second=3.19561Gi/s

            37,622      ASE_SPEC
     9,189,030,863      instructions                     #    3.48  insn per cycle
     2,643,007,625      cycles

profile neon encode

perf stat -e ASE_SPEC,instructions,cycles -- release/parquet-encoding-benchmark --benchmark_filter=BM_ByteStreamSplitEncode_Double_Neon/65536

-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
BM_ByteStreamSplitEncode_Double_Neon/65536     177150 ns       177149 ns         3948 bytes_per_second=2.75633Gi/s

     2,985,430,024      ASE_SPEC
     7,998,814,771      instructions                     #    2.94  insn per cycle
     2,718,202,634      cycles

Mar 11 '24 02:03 cyb70289

arrow arrow copied to clipboard

GH-38560: [C++][Parquet] Rewrite BYTE_STREAM_SPLIT SSE optimizations using xsimd

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

arrow
arrow copied to clipboard