cudf Align integral types in ORC to specs

Align integral types in ORC to specs

Open vuule opened this issue 1 year ago • 2 comments

Description

Use uint64_t where specified by the ORC specs:

PostScript::compressionBlockSize
StripeInformation::footerLength
StripeInformation::numberOfRows

Using the same type for the derived values.

Other changes:

Changed the num_rows in orc_metadata to uint64_t so it works with files that have more than 2B rows.
Modified how the skiprows parameter in Python is converted to a C++ value, so now we can skip more than 2B rows.
Renamed FileFooter to Footer to match the specs.

No measurable impact on performance or on the memory footprint of the ORC reader.

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[ ] The documentation is up to date with these changes.

Feb 08 '24 03:02 vuule

Please also take a look at stripe_init.cu to see if there is anything we need to modify there.

Feb 09 '24 05:02 ttnghia

Please also take a look at stripe_init.cu to see if there is anything we need to modify there.

Went over that code and made changes in functions related to the reader. I believe there are some places where specs say 64bit, but we wouldn't be able to parse such files (e.g. compression block size). I think we can just throw when block size is greater than 2B :D

Feb 10 '24 00:02 vuule

A small question: do we intend to exercise the 2B element handling in the unit tests?

Added a test that takes 2.5 seconds on my system. I think that's acceptable.

Feb 23 '24 21:02 vuule

/merge

Feb 24 '24 03:02 vuule

cudf cudf copied to clipboard

Align integral types in ORC to specs

Description

Checklist

cudf
cudf copied to clipboard