arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Gh-539][ParquetEncoding][c++] Add ALPpd encoding to parquet

Open prtkgaur opened this issue 3 weeks ago • 3 comments

Co-authored-by: [email protected]

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

What changes are included in this PR?

This PR Introduces ALP (pseudo-decimal) encoding into c++ arrow code. We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.

Adding above needed us to add following classes.

  • Alp h/cc : Houses core logic for encoding and decoding.
  • Sampler h/cc : Houses logic to sample and select parameters for encoding.
  • AlpWrapper h/cc : Binds together Alp and Sampler classes.

Integration of the above code was done in

  • Encoder/Decoder cc which exposes wrapper to encode buffer of data.

Unit tests were added to

  • alp_test.cc

And Benchmarks are added to

  • encoding_benchmark.cc and encoding_alp_benchmark.cc

Are these changes tested?

  • We have added unit tests to test the code.
  • Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Are there any user-facing changes?

  • It's a new encoding so the only impact is query performance which we claim will only get better.

prtkgaur avatar Dec 05 '25 00:12 prtkgaur

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

github-actions[bot] avatar Dec 05 '25 00:12 github-actions[bot]

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

alamb avatar Dec 08 '25 14:12 alamb

Talked offline and wanted to capture notes on high-level changes:

  1. For headers, lets try to reduce duplication with values already in the parquet header.
  2. For remaining items in headers, lets try to be parsimonious with values (i.e. 4 bytes is probably overkill for enums)
  3. Naming convention on files is off (use snake_case).
  4. Given description of ALP, we probably want a top level encoding enum value for the 2 different modes of ALP.

emkornfield avatar Dec 09 '25 21:12 emkornfield

Talked offline and wanted to capture notes on high-level changes:

  1. For headers, lets try to reduce duplication with values already in the parquet header.
  2. For remaining items in headers, lets try to be parsimonious with values (i.e. 4 bytes is probably overkill for enums)
  3. Naming convention on files is off (use snake_case).
  4. Given description of ALP, we probably want a top level encoding enum value for the 2 different modes of ALP.

Thanks for the feedback @emkornfield. We have addressed

  1. Reduce duplication of fields between page header and alp header
  2. Other fields have been updated to use 1 byte. Header is now just 8 bytes compared to 40 bytes earlier.
  3. Naming of files has been updated.
  4. We do have the top level enums describing the mode and layout structure. enum class AlpBitPackLayout { kNormal }; and enum class AlpMode { kAlp };

prtkgaur avatar Dec 16 '25 01:12 prtkgaur