InternalParquetRecordWriter doesn't use min/max row counts
PARQUET-99 added settings to control the min and max number of rows between size checks when flushing pages, and a setting to control whether to always use a static size (the min). The InternalParquetRecordWriter has similar checks that don't use those settings. We should determine if it should update it to use those settings or similar.
Reporter: Ryan Blue / @rdblue
Related issues:
- Min/Max record counts for block size checks are not configurable (duplicates)
- Large rows cause unnecessary OOM exceptions (is related to)
PRs and other links:
Note: This issue was originally created as PARQUET-409. Please see the migration documentation for further details.
Ryan Blue / @rdblue: @danielcweeks, do you have thoughts on this?
Daniel C. Weeks / @danielcweeks: I definitely think it's worth exposing as a configurable property. However, I haven't seen an issue where these checks are producing bad row group sizes.
I have seen some outlier datasets that have rows in excess of 10MB, but only a few records per file. With that kind of size, you could get disproportionately sized row groups given enough records.
Sasha Ovsankin: In our use case we have large data items (images) and 100 rows hardcoded minimum is too large. We definitely need this parameter.
Robbie Gruener / @rgruener: So I have created a PR for this task https://github.com/apache/parquet-mr/pull/495