arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[C++][Parquet] Denominate row group size in bytes (not in no of rows)

Open asfimport opened this issue 6 years ago • 4 comments

Both the C++ implementation of parquet writer for arrow and the Python code bound to it appears denominated in the number of rows (without making it very explicit). Whereas:

(1) The Apache parquet documentation states: 

"Row group size: Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in the write path (or a two pass write). We recommend large row groups (512MB - 1GB). Since an entire row group might need to be read, we want it to completely fit on one HDFS block. Therefore, HDFS block sizes should also be set to be larger. An optimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block per HDFS file."

(2) Reference Apache parquet-mr implementation for Java accepts the row size expressed in bytes.

(3) The low-level parquet read-write example also considers row group be denominated in bytes.

These insights make me conclude that:

  • Per parquet design and to take advantage of HDFS block level operations, it only makes sense to work with row group sizes as expressed in bytes - as that is the only consequential desire the caller can utter and want to influence.

  • Arrow implementation of ParquetWriter would benefit from re-nominating its row_group_size into bytes. I will also note it is impossible to use pyarrow to shape equally byte-sized row groups as the size the row group takes is post-compression and the caller only know how much uncompressed data they have managed to put in.

    Now, my conclusions can be wrong and I may be blind to some alley of reasoning, so this ticket is more of a question than a bug. A question on whether the audience here agrees with my reasoning and if not - to explain what detail I have missed.

     

     

     

Reporter: Remek Zajac

Note: This issue was originally created as ARROW-4542. Please see the migration documentation for further details.

asfimport avatar Feb 12 '19 14:02 asfimport

Wes McKinney / @wesm: Having the option to size based on number of bytes is reasonable, however it is not necessarily trivial to implement. If you would like to help with this, please be our guest

asfimport avatar Feb 12 '19 15:02 asfimport

Remek Zajac: Thanks for the prompt feedback  @wesm - glad you agree with my reasoning. Sure - i know it ain't trivial. I saw what parquet-mr does. I will wait for the ticket to cool off a bit and/or collect more feedback. And thanks for inviting my contribution. We may find time to help here depending how much it ends up hurting us compared to other places that hurt too.

asfimport avatar Feb 12 '19 16:02 asfimport

Antoine Pitrou / @pitrou: [[email protected]] Are you still willing to work on this?

asfimport avatar Jan 30 '20 17:01 asfimport

This issue has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label or comment below, or this issue will be closed in 14 days. If this improvement is still desired but has no current owner, please add the 'Status: needs champion' label.

github-actions[bot] avatar Dec 13 '25 11:12 github-actions[bot]