cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Add ability to request Parquet encodings on a per-column basis

Open etseidl opened this issue 1 year ago • 1 comments

Description

Allows users to request specific page encodings to use on a column-by-column basis. This is accomplished by adding an encoding property to the column_input_metadata struct. This is a necessary change before adding DELTA_BYTE_ARRAY encoding.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

etseidl avatar Feb 16 '24 23:02 etseidl

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Feb 16 '24 23:02 copy-pr-bot[bot]

/ok to test

vuule avatar Mar 01 '24 02:03 vuule

/ok to test

PointKernel avatar Mar 04 '24 17:03 PointKernel

@vuule @PointKernel I caught a last minute bug. I now print a warning if dict encoding is requested but cannot be used, and modified some logic in gpuInitPages. Do you all think the warning is necessary? And should I add a test for this case?

etseidl avatar Mar 04 '24 23:03 etseidl

/ok to test

PointKernel avatar Mar 05 '24 00:03 PointKernel

Do you all think the warning is necessary? And should I add a test for this case?

Yes, it's good to have a warning and a test exercising this. I assume the test doesn't require significant effort?

PointKernel avatar Mar 05 '24 00:03 PointKernel

Do you all think the warning is necessary? And should I add a test for this case?

Yes, it's good to have a warning and a test exercising this. I assume the test doesn't require significant effort?

No, it will be a pretty quick add.

etseidl avatar Mar 05 '24 00:03 etseidl

Test checked in...but last CI run failed style check because of some python?

etseidl avatar Mar 05 '24 00:03 etseidl

Test checked in...but last CI run failed style check because of some python?

Yeah, that's a new one. I think there was a two-hour window when CI passed :skull:

vuule avatar Mar 05 '24 01:03 vuule

/ok to test

vuule avatar Mar 05 '24 03:03 vuule

/ok to test

PointKernel avatar Mar 05 '24 20:03 PointKernel

Perhaps not in this PR, but we will ultimately want to expose this in cuDF-python. In pandas, to_parquet with the pyarrow engine supports the same column_encoding parameter.

GregoryKimball avatar Mar 05 '24 22:03 GregoryKimball

Perhaps not in this PR, but we will ultimately want to expose this in cuDF-python. pyarrow supports the same column_encoding parameter.

That was my original plan, but that's a heavier lift and I just wanted the bare minimum to at least be able to test new encoders. The pain point with how pyarrow does it is knowing in advance what the column names will be, esp for nested (see #14539 for instance).

etseidl avatar Mar 05 '24 22:03 etseidl

/ok to test

vuule avatar Mar 05 '24 22:03 vuule

/merge

PointKernel avatar Mar 06 '24 06:03 PointKernel