Expose some Parquet per-column configuration options via the python API
Description
Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was suggested that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a list<int32> column 'b', the fully qualified column names would be 'a' and 'b.list.element'.
Addresses "Add cuDF-python API support for specifying encodings" task in #13501.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
@GregoryKimball does this do what you wanted?
/ok to test
/ok to test
I think we would benefit from some tests using
write_parquetfunction and theParquetWriterto write data and then using pyarrow parquet reader to read and verify correct encodings for each column.
@mhaseeb123 I added some parameterization to at least verify that all the valid Parquet encodings work on the write side. There is already testing elsewhere to verify encoding interoperability with pyarrow/pandas.[1] I also added a check that the column that should still be compressed actually is.
The option set for ParquetWriter has fallen far behind the DataFrame.to_parquet() path. Bringing the former up to date is beyond scope for this PR IMHO.
[1] With the exception of BYTE_STREAM_SPLIT, which was only recently fully implemented in arrow 16. This too can be addressed in a follow up PR.
/ok to test
/ok to test
@galipremsagar Would you please check how cudf.pandas runs when column_encoding is specified?
The user command would be something like df.to_parquet(columns_encoding='DELTA_BYTE_ARRAY').
By the way, what would cudf.pandas do if the user specified an engine such as:
df.to_parquet(engine='pyarrow', columns_encoding='DELTA_BYTE_ARRAY')?
@galipremsagar Would you please check how
cudf.pandasruns whencolumn_encodingis specified?The user command would be something like
df.to_parquet(columns_encoding='DELTA_BYTE_ARRAY').
This will invoke cudf parquet writer.
By the way, what would
cudf.pandasdo if the user specified anenginesuch as:df.to_parquet(engine='pyarrow', columns_encoding='DELTA_BYTE_ARRAY')?
This will invoke pyarrow's parquet writer through cudf code, but ignore columns_encoding
/okay to test
/merge