tpch Generated Parquet files are extremely fragmented

Generated Parquet files are extremely fragmented

Open jaychia opened this issue 2 months ago • 8 comments

Hi, I noticed that the generated Parquet files are extremely fragmented in terms of rowgroups. This likely indicates a bug/issue in the Polars Parquet writer, but definitely also affects the results of the benchmarks.

For a SCALE_FACTOR=10 table generation, the Parquet files have a staggering 20,000 rowgroups!

Each rowgroup only has about 3,400 rows and a size of 117kB. For reference, Parquet rowgroups are often suggested to be in the range of about 128MB. Because we have so many rowgroups, the Parquet metadata itself is 27MB and it likely introduces a ton of hops in the process of reading the file 😅

Writing this instead with PyArrow (I amended the code in prepare_data.py), we get much more well-behaved rowgroups:

Still fairly small as rowgroups go, but I think it's much more reasonable and represents Parquet data in the wild a little better!

May 03 '24 06:05 jaychia

tpch tpch copied to clipboard

Generated Parquet files are extremely fragmented

tpch
tpch copied to clipboard