cudf
cudf copied to clipboard
Align integral types in ORC to specs
Description
Use uint64_t
where specified by the ORC specs:
-
PostScript::compressionBlockSize
-
StripeInformation::footerLength
-
StripeInformation::numberOfRows
Using the same type for the derived values.
Other changes:
- Changed the num_rows in orc_metadata to uint64_t so it works with files that have more than 2B rows.
- Modified how the skiprows parameter in Python is converted to a C++ value, so now we can skip more than 2B rows.
- Renamed
FileFooter
toFooter
to match the specs.
No measurable impact on performance or on the memory footprint of the ORC reader.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
Please also take a look at stripe_init.cu
to see if there is anything we need to modify there.
Please also take a look at
stripe_init.cu
to see if there is anything we need to modify there.
Went over that code and made changes in functions related to the reader. I believe there are some places where specs say 64bit, but we wouldn't be able to parse such files (e.g. compression block size). I think we can just throw when block size is greater than 2B :D
A small question: do we intend to exercise the 2B element handling in the unit tests?
Added a test that takes 2.5 seconds on my system. I think that's acceptable.
/merge