cudf
cudf copied to clipboard
[FEA] Add Parquet and ORC unit tests based on Apache sample files
During the 23.06 release, we encountered several important Parquet and ORC writer issues that risked data corruption. These issues included:
- Rare failure with page size estimator (PQ writer, Report, Fix)
- Failure with >1GB tables (PQ writer, Report, Fix)
- Failure with 10k nulls followed by >5 valid values (ORC Writer, Report, Fix)
After discussion with the team we agreed on these additions to our testing suite to help prevent similar issues in the future:
- Based on test files in parquet-testing/data, verify that "read" versus "read-write-read" result in identical tables
- Based on test files in orc/examples, verify that "read" versus "read-write-read" result in identical tables
- Based on test files in parquet-testing/data, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables
- Based on test files in orc/examples, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables
Note: please also see (#12739), for reader benchmarks, verify that the roundtripped table matches the starting table