arrow [Gh-539][ParquetEncoding][c++] Add ALPpd encoding to parquet

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

What changes are included in this PR?

This PR Introduces ALP (pseudo-decimal) encoding into c++ arrow code. We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.

Adding above needed us to add following classes.

Alp h/cc : Houses core logic for encoding and decoding.
Sampler h/cc : Houses logic to sample and select parameters for encoding.
AlpWrapper h/cc : Binds together Alp and Sampler classes.

Integration of the above code was done in

Encoder/Decoder cc which exposes wrapper to encode buffer of data.

Unit tests were added to

alp_test.cc

And Benchmarks are added to

encoding_benchmark.cc and encoding_alp_benchmark.cc

Are these changes tested?

We have added unit tests to test the code.
Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Are there any user-facing changes?

It's a new encoding so the only impact is query performance which we claim will only get better.

Dec 05 '25 00:12 prtkgaur

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

Dec 05 '25 00:12 github-actions[bot]

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

Dec 08 '25 14:12 alamb

Talked offline and wanted to capture notes on high-level changes:

For headers, lets try to reduce duplication with values already in the parquet header.
For remaining items in headers, lets try to be parsimonious with values (i.e. 4 bytes is probably overkill for enums)
Naming convention on files is off (use snake_case).
Given description of ALP, we probably want a top level encoding enum value for the 2 different modes of ALP.

Dec 09 '25 21:12 emkornfield

Talked offline and wanted to capture notes on high-level changes:

For headers, lets try to reduce duplication with values already in the parquet header.

For remaining items in headers, lets try to be parsimonious with values (i.e. 4 bytes is probably overkill for enums)

Naming convention on files is off (use snake_case).

Given description of ALP, we probably want a top level encoding enum value for the 2 different modes of ALP.

Thanks for the feedback @emkornfield. We have addressed

Reduce duplication of fields between page header and alp header
Other fields have been updated to use 1 byte. Header is now just 8 bytes compared to 40 bytes earlier.
Naming of files has been updated.
We do have the top level enums describing the mode and layout structure. enum class AlpBitPackLayout { kNormal }; and enum class AlpMode { kAlp };

Dec 16 '25 01:12 prtkgaur