arrow GH-48277: [C++][Parquet] unpack with shuffle algorithm

Rationale for this change

What changes are included in this PR?

Add a new method for building unpacking kernels. The constexpr code generation creates a kernel appropriate for a given input/output bit width and simd size.
I have included a number of xsimd fallback that have been merged upstream.
I have run extensive benchmarks and re-dispatched among different sizes on specific architectures when it was not performing well.
The biggest win here is SSE4.2, though AVX2 improves too.
This is not built/tested for AVX512, though there are not really limitation. Currently the arch detection between all the avx512 is not consistent and sometimes error. I would need to investigate with the upcoming xsimd release.

Are these changes tested?

Yes

Are there any user-facing changes?

No

GitHub Issue: #48277

Oct 29 '25 14:10 AntoinePrv

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Oct 29 '25 14:10 github-actions[bot]

:warning: GitHub issue #48277 has been automatically assigned in GitHub to PR creator.

Nov 27 '25 13:11 github-actions[bot]

@pitrou apart from R-lint, this is looking pretty good.

Nov 27 '25 18:11 AntoinePrv

@ursabot please benchmark lang=C++

Nov 27 '25 18:11 pitrou

Benchmark runs are scheduled for commit a4bfe8addf409c235e0fd96eead5b489447029d0. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

Nov 27 '25 18:11 voltrondatabot

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a4bfe8addf409c235e0fd96eead5b489447029d0.

There were 37 benchmark results indicating a performance regression:

Pull Request Run on amd64-c6a-4xlarge-linux at 2025-11-27 19:32:37Z
- BM_UnpackUint64 (C++) with params=DynamicAligned/47/64, source=cpp-micro, suite=arrow-bpacking-benchmark
- IsInInt64SmallSet (C++) with params=64, source=cpp-micro, suite=arrow-compute-scalar-set-lookup-benchmark
and 35 more (see the report linked below)

The full Conbench report has more details.

Nov 27 '25 20:11 conbench-apache-arrow[bot]

@pitrou I'm running this locally, and I made an error when fixing ASAN over-reading problem. These latest benchmarks are not doing well.

Nov 28 '25 09:11 AntoinePrv

@ursabot please benchmark lang=C++

Nov 28 '25 14:11 pitrou

Benchmark runs are scheduled for commit dd3ec0d692e4f409bd73952de9bab20d8c97c226. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

Nov 28 '25 14:11 voltrondatabot

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit dd3ec0d692e4f409bd73952de9bab20d8c97c226.

There were 19 benchmark results indicating a performance regression:

Pull Request Run on amd64-c6a-4xlarge-linux at 2025-11-28 15:12:47Z
- BM_UnpackUint32 (C++) with params=DynamicUnaligned/20/64, source=cpp-micro, suite=arrow-bpacking-benchmark
- BM_DeltaLengthDecodingByteArray (C++) with params=max-string-length:8/batch-size:2048, source=cpp-micro, suite=parquet-encoding-benchmark
and 17 more (see the report linked below)

The full Conbench report has more details.

Nov 28 '25 18:11 conbench-apache-arrow[bot]

@ursabot please benchmark lang=C++

Dec 01 '25 15:12 pitrou

Benchmark runs are scheduled for commit 408ef04ad96a9752654cfe54d4de6c7c2eef08cc. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

Dec 01 '25 15:12 voltrondatabot

@ursabot please benchmark lang=C++

Dec 02 '25 15:12 pitrou

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04ad96a9752654cfe54d4de6c7c2eef08cc.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

Dec 10 '25 19:12 conbench-apache-arrow[bot]

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 408ef04ad96a9752654cfe54d4de6c7c2eef08cc.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

Dec 13 '25 15:12 conbench-apache-arrow[bot]

@ursabot please benchmark lang=C++

Dec 17 '25 14:12 pitrou