cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Expose some Parquet per-column configuration options via the python API

Open etseidl opened this issue 1 year ago • 4 comments

Description

Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was suggested that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a list<int32> column 'b', the fully qualified column names would be 'a' and 'b.list.element'.

Addresses "Add cuDF-python API support for specifying encodings" task in #13501.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

etseidl avatar Apr 29 '24 22:04 etseidl

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Apr 29 '24 22:04 copy-pr-bot[bot]

@GregoryKimball does this do what you wanted?

etseidl avatar Apr 29 '24 22:04 etseidl

/ok to test

vuule avatar May 08 '24 20:05 vuule

/ok to test

vuule avatar May 09 '24 19:05 vuule

I think we would benefit from some tests using write_parquet function and the ParquetWriter to write data and then using pyarrow parquet reader to read and verify correct encodings for each column.

@mhaseeb123 I added some parameterization to at least verify that all the valid Parquet encodings work on the write side. There is already testing elsewhere to verify encoding interoperability with pyarrow/pandas.[1] I also added a check that the column that should still be compressed actually is.

The option set for ParquetWriter has fallen far behind the DataFrame.to_parquet() path. Bringing the former up to date is beyond scope for this PR IMHO.

[1] With the exception of BYTE_STREAM_SPLIT, which was only recently fully implemented in arrow 16. This too can be addressed in a follow up PR.

etseidl avatar May 10 '24 23:05 etseidl

/ok to test

mhaseeb123 avatar May 14 '24 02:05 mhaseeb123

/ok to test

vuule avatar May 16 '24 16:05 vuule

@galipremsagar Would you please check how cudf.pandas runs when column_encoding is specified?

The user command would be something like df.to_parquet(columns_encoding='DELTA_BYTE_ARRAY').

By the way, what would cudf.pandas do if the user specified an engine such as: df.to_parquet(engine='pyarrow', columns_encoding='DELTA_BYTE_ARRAY')?

GregoryKimball avatar May 20 '24 18:05 GregoryKimball

@galipremsagar Would you please check how cudf.pandas runs when column_encoding is specified?

The user command would be something like df.to_parquet(columns_encoding='DELTA_BYTE_ARRAY').

This will invoke cudf parquet writer.

By the way, what would cudf.pandas do if the user specified an engine such as: df.to_parquet(engine='pyarrow', columns_encoding='DELTA_BYTE_ARRAY')?

This will invoke pyarrow's parquet writer through cudf code, but ignore columns_encoding

galipremsagar avatar May 22 '24 14:05 galipremsagar

/okay to test

galipremsagar avatar May 22 '24 17:05 galipremsagar

/merge

vuule avatar May 22 '24 20:05 vuule