cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Align integral types in ORC to specs

Open vuule opened this issue 1 year ago • 2 comments

Description

Use uint64_t where specified by the ORC specs:

  • PostScript::compressionBlockSize
  • StripeInformation::footerLength
  • StripeInformation::numberOfRows

Using the same type for the derived values.

Other changes:

  • Changed the num_rows in orc_metadata to uint64_t so it works with files that have more than 2B rows.
  • Modified how the skiprows parameter in Python is converted to a C++ value, so now we can skip more than 2B rows.
  • Renamed FileFooter to Footer to match the specs.

No measurable impact on performance or on the memory footprint of the ORC reader.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [ ] The documentation is up to date with these changes.

vuule avatar Feb 08 '24 03:02 vuule

Please also take a look at stripe_init.cu to see if there is anything we need to modify there.

ttnghia avatar Feb 09 '24 05:02 ttnghia

Please also take a look at stripe_init.cu to see if there is anything we need to modify there.

Went over that code and made changes in functions related to the reader. I believe there are some places where specs say 64bit, but we wouldn't be able to parse such files (e.g. compression block size). I think we can just throw when block size is greater than 2B :D

vuule avatar Feb 10 '24 00:02 vuule

A small question: do we intend to exercise the 2B element handling in the unit tests?

Added a test that takes 2.5 seconds on my system. I think that's acceptable.

vuule avatar Feb 23 '24 21:02 vuule

/merge

vuule avatar Feb 24 '24 03:02 vuule