parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

InternalParquetRecordWriter doesn't use min/max row counts

Open asfimport opened this issue 10 years ago • 4 comments

PARQUET-99 added settings to control the min and max number of rows between size checks when flushing pages, and a setting to control whether to always use a static size (the min). The InternalParquetRecordWriter has similar checks that don't use those settings. We should determine if it should update it to use those settings or similar.

Reporter: Ryan Blue / @rdblue

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-409. Please see the migration documentation for further details.

asfimport avatar Dec 16 '15 18:12 asfimport

Ryan Blue / @rdblue: @danielcweeks, do you have thoughts on this?

asfimport avatar Dec 16 '15 18:12 asfimport

Daniel C. Weeks / @danielcweeks: I definitely think it's worth exposing as a configurable property. However, I haven't seen an issue where these checks are producing bad row group sizes.

I have seen some outlier datasets that have rows in excess of 10MB, but only a few records per file. With that kind of size, you could get disproportionately sized row groups given enough records.

asfimport avatar Dec 16 '15 20:12 asfimport

Sasha Ovsankin: In our use case we have large data items (images) and 100 rows hardcoded minimum is too large. We definitely need this parameter.

asfimport avatar Jan 10 '18 18:01 asfimport

Robbie Gruener / @rgruener: So I have created a PR for this task https://github.com/apache/parquet-mr/pull/495

asfimport avatar Jun 14 '18 14:06 asfimport