cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Add Parquet and ORC unit tests based on Apache sample files

Open GregoryKimball opened this issue 2 years ago • 0 comments

During the 23.06 release, we encountered several important Parquet and ORC writer issues that risked data corruption. These issues included:

  • Rare failure with page size estimator (PQ writer, Report, Fix)
  • Failure with >1GB tables (PQ writer, Report, Fix)
  • Failure with 10k nulls followed by >5 valid values (ORC Writer, Report, Fix)

After discussion with the team we agreed on these additions to our testing suite to help prevent similar issues in the future:

  • Based on test files in parquet-testing/data, verify that "read" versus "read-write-read" result in identical tables
  • Based on test files in orc/examples, verify that "read" versus "read-write-read" result in identical tables
  • Based on test files in parquet-testing/data, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables
  • Based on test files in orc/examples, verify that "read" versus "read_with_Arrow-convert_to_cudf" result in identical tables

Note: please also see (#12739), for reader benchmarks, verify that the roundtripped table matches the starting table

GregoryKimball avatar Jun 27 '23 19:06 GregoryKimball