parquet-testing icon indicating copy to clipboard operation
parquet-testing copied to clipboard

add parquet test data

Open yaqi-zhao opened this issue 3 years ago • 3 comments

yaqi-zhao avatar Dec 08 '22 01:12 yaqi-zhao

Hi @yaqi-zhao ,

  1. Can you clarify the PR title and description to explain what this is about?
  2. Can you fill in information about the data files in https://github.com/apache/parquet-testing/blob/master/data/README.md?

pitrou avatar Dec 13 '22 14:12 pitrou

Hi, @pitrou I submitted a PR to Apache/Arrow(https://github.com/apache/arrow/pull/14585) and add a benchmark test which will use these files. The test intend to analyze the parquet reader performace with the different bit width packing.

yaqi-zhao avatar Dec 14 '22 06:12 yaqi-zhao

How long does it take to generate those files on the fly from the benchmarks?

In general parquet-testing is for interoperability testing between different Parquet implementations, not for benchmarking of individual implementations.

At worse we could use arrow-testing for that, but even then we should strive to make the files much smaller. We don't want to consume hundreds of MB just for a single set of benchmarks, IMHO.

pitrou avatar Dec 14 '22 09:12 pitrou