[C++][Parquet] Ability to concat parquet files
Ability to concat the parquet files is something we've wanted for some time too. When we generate parquet files partitioned by an expression, we often end up with tiny files and would like to add a post-processing step to concat these files together.
Is there a plan to add this ability to the library any time soon?
If not, it would be great if someone can provide a somewhat detailed pseudocode (expanding on what @xhochy mentioned in the comment in PARQUET-1022) as a guideline for conditions/scenarios that need to be handled with extra care, so we can contribute this as a PR.
Reporter: Nileema Shingte
Note: This issue was originally created as PARQUET-1626. Please see the migration documentation for further details.
Deepak Majeti / @majetideepak: The simplest approach is to read the two files and export them back as a single file. You can follow the existing reader-writer.cc example to do this.
If you want to optimize by avoiding the compression/decompression of the Data Pages, then you have to carefully update the metadata (counts, stats, offsets, lengths, etc.) at the File, ColumnChunk levels and append the individual RowGroups. If you further want to append two RowGroups together, you have to update the RowGroup metadata.
On top of this, you have to ensure the two files being merged are compatible. In your case, this won't be a problem since you have the same writer generating the parquet files with the same schema. But in general, if the two files are generated from different writers, you cannot easily merge them.
David Lee / @davlee1972: I'm appending RowGroups using pyarrow today.
Open a new parquet file
For each row group in File 1:
read row group. write row group to new file
For each row group in File 2:
read row group. write row group to new file
Close new parquet file
No need to mess with metadata since all those stats are saved at a row group level.
I usually generate parquet files which are 30 to 40 megs each and I merge them afterwards to match the HDFS blocksize.
This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.
Labelled Status: Stale-Warning for tracking.