[Gh-539][ParquetEncoding][c++] Add ALPpd encoding to parquet
Co-authored-by: [email protected]
Rationale for this change
ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.
What changes are included in this PR?
This PR Introduces ALP (pseudo-decimal) encoding into c++ arrow code. We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.
Adding above needed us to add following classes.
- Alp h/cc : Houses core logic for encoding and decoding.
- Sampler h/cc : Houses logic to sample and select parameters for encoding.
- AlpWrapper h/cc : Binds together Alp and Sampler classes.
Integration of the above code was done in
- Encoder/Decoder cc which exposes wrapper to encode buffer of data.
Unit tests were added to
- alp_test.cc
And Benchmarks are added to
- encoding_benchmark.cc and encoding_alp_benchmark.cc
Are these changes tested?
- We have added unit tests to test the code.
- Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.
Are there any user-facing changes?
- It's a new encoding so the only impact is query performance which we claim will only get better.
Thanks for opening a pull request!
If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
See also:
Thanks @prtkgaur -- it is super exciting to see this movement.
Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.
I started the CI checks on this PR and had some comments about the testing.
Talked offline and wanted to capture notes on high-level changes:
- For headers, lets try to reduce duplication with values already in the parquet header.
- For remaining items in headers, lets try to be parsimonious with values (i.e. 4 bytes is probably overkill for enums)
- Naming convention on files is off (use snake_case).
- Given description of ALP, we probably want a top level encoding enum value for the 2 different modes of ALP.
Talked offline and wanted to capture notes on high-level changes:
- For headers, lets try to reduce duplication with values already in the parquet header.
- For remaining items in headers, lets try to be parsimonious with values (i.e. 4 bytes is probably overkill for enums)
- Naming convention on files is off (use snake_case).
- Given description of ALP, we probably want a top level encoding enum value for the 2 different modes of ALP.
Thanks for the feedback @emkornfield. We have addressed
- Reduce duplication of fields between page header and alp header
- Other fields have been updated to use 1 byte. Header is now just 8 bytes compared to 40 bytes earlier.
- Naming of files has been updated.
- We do have the top level enums describing the mode and layout structure. enum class AlpBitPackLayout { kNormal }; and enum class AlpMode { kAlp };