datatrove Optimize parquet output for remote reading

Optimize parquet output for remote reading

Open H-Plus-Time opened this issue 6 months ago • 2 comments

TLDR: the primary pain point here is huge (in terms of total uncompressed byte size) row groups - writing the PageIndex OR reducing row group sizes, perhaps both, would help a lot.

Basically, the defaults in pyarrow (and most parquet implementations) for row group sizes (1 million rows per row group) are predicated on assumptions about what a typical parquet file looks like (lots of numerics, booleans, relatively short strings amenable to dictionary, RLE and delta encoding); wide text datasets are very much not typical, and default row group sizes get you ~2GB per row group (and nearly 4GB uncompressed, just for the text column).

The simplest thing to do would be to default to 100k for the row_group_size parameter - more or less the inflection point of this benchmark by DuckDB (size overhead is about 0.5%).

Setting write_page_index to true should help a great deal (arguably much more than smaller row groups), as readers can use that to refine reads to individual data pages (not unusual for point lookups to hit 0.1% of a file).

Aug 02 '24 10:08 H-Plus-Time

datatrove datatrove copied to clipboard

Optimize parquet output for remote reading

datatrove
datatrove copied to clipboard