cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Support V2 encodings in Parquet reader and writer

Open GregoryKimball opened this issue 2 years ago • 1 comments

Parquet V1 format supports three types of page encodings: PLAIN, DICTIONARY, and RLE (run-length encoded) (reference from Spark Jira). The newer and evolving Parquet V2 specification adds support for several additional encodings, including DELTA_BINARY_PACKED for INT32 and INT64 types, DELTA_BYTE_ARRAY for strings logical type, and DELTA_LENGTH_BYTE_ARRAY for strings logical type.

In the parquet reader and writer, libcudf should support V2 metadata as well as the three variants of DELTA encoding.

Feature Status Notes
Add V2 reader support ✅ #11778
Multi-warp decode of Dremel data streams ✅ #13203
Use efficient strings column factory in decoder ✅ #13302
Implement DELTA_BINARY_PACKED decoding ✅ #13637 see #12948 for reference
Implement DELTA_BYTE_ARRAY decoding ✅ #14101 see #12948 for reference
Add V2 writer support ✅ #13751
Implement DELTA_BINARY_PACKED encoding ✅ #14100
Add python bindings for V2 header and options ✅ #14316
Implement DELTA_BYTE_ARRAY encoding ✅ #15239 some outdated reviews in #14938
Implement DELTA_LENGTH_BYTE_ARRAY encoding and decoding for unsorted data ✅ #14590
Add C++ API support for specifying encodings ✅ #15081
Add cuDF-python API support for specifying encodings
Add BYTE_STREAM_SPLIT encoding and decoding ✅ #15311 see issue #15226 and parquet reference

GregoryKimball avatar Jun 02 '23 03:06 GregoryKimball

@GregoryKimball should there be an entry for adding python bindings for the V2 options?

etseidl avatar Sep 14 '23 00:09 etseidl

Congratulations @etseidl! Everyone, please stay tuned for a technical blog on this topic! 😄

GregoryKimball avatar Jun 11 '24 20:06 GregoryKimball