[FEA] Support V2 encodings in Parquet reader and writer
Parquet V1 format supports three types of page encodings: PLAIN, DICTIONARY, and RLE (run-length encoded) (reference from Spark Jira). The newer and evolving Parquet V2 specification adds support for several additional encodings, including DELTA_BINARY_PACKED for INT32 and INT64 types, DELTA_BYTE_ARRAY for strings logical type, and DELTA_LENGTH_BYTE_ARRAY for strings logical type.
In the parquet reader and writer, libcudf should support V2 metadata as well as the three variants of DELTA encoding.
| Feature | Status | Notes |
|---|---|---|
| Add V2 reader support | ✅ #11778 | |
| Multi-warp decode of Dremel data streams | ✅ #13203 | |
| Use efficient strings column factory in decoder | ✅ #13302 | |
| Implement DELTA_BINARY_PACKED decoding | ✅ #13637 | see #12948 for reference |
| Implement DELTA_BYTE_ARRAY decoding | ✅ #14101 | see #12948 for reference |
| Add V2 writer support | ✅ #13751 | |
| Implement DELTA_BINARY_PACKED encoding | ✅ #14100 | |
| Add python bindings for V2 header and options | ✅ #14316 | |
| Implement DELTA_BYTE_ARRAY encoding | ✅ #15239 | some outdated reviews in #14938 |
| Implement DELTA_LENGTH_BYTE_ARRAY encoding and decoding for unsorted data | ✅ #14590 | |
| Add C++ API support for specifying encodings | ✅ #15081 | |
| Add cuDF-python API support for specifying encodings | ||
| Add BYTE_STREAM_SPLIT encoding and decoding | ✅ #15311 | see issue #15226 and parquet reference |
@GregoryKimball should there be an entry for adding python bindings for the V2 options?
Congratulations @etseidl! Everyone, please stay tuned for a technical blog on this topic! 😄