add parquet test data
Hi @yaqi-zhao ,
- Can you clarify the PR title and description to explain what this is about?
- Can you fill in information about the data files in https://github.com/apache/parquet-testing/blob/master/data/README.md?
Hi, @pitrou I submitted a PR to Apache/Arrow(https://github.com/apache/arrow/pull/14585) and add a benchmark test which will use these files. The test intend to analyze the parquet reader performace with the different bit width packing.
How long does it take to generate those files on the fly from the benchmarks?
In general parquet-testing is for interoperability testing between different Parquet implementations, not for benchmarking of individual implementations.
At worse we could use arrow-testing for that, but even then we should strive to make the files much smaller. We don't want to consume hundreds of MB just for a single set of benchmarks, IMHO.