iceberg Add ParquetFileMerger for efficient row-group level file merging

Why this change?

This implementation provides significant performance improvements for Parquet file merging operations by eliminating serialization/deserialization overhead. Benchmark results show 10x faster file merging compared to traditional read-rewrite approaches.

The change leverages existing Parquet library capabilities (ParquetFileWriter appendFile API) to perform zero-copy row-group merging, making it ideal for compaction and maintenance operations on large Iceberg tables.

TODO: 1) Encrypted tables are not supported yet. 2) Schema evolution is not handled yet

What changed?

Added ParquetFileMerger class for row-group level file merging
- Performs zero-copy merging using ParquetFileWriter.appendFile()
- Validates schema compatibility across all input files
- Supports merging multiple Parquet files into a single output file
Reuses existing Apache Parquet library functionality instead of custom implementation
Strict schema validation ensures data integrity during merge operations
Added comprehensive error handling for schema mismatches

Testing

Validated in staging test environment
Verified schema compatibility checks work correctly
Confirmed 13x performance improvement over traditional approach
Tested with various file sizes and row group configurations

Oct 28 '25 17:10 shangxinli

Thanks @shangxinli for the PR! At a high level, leveraging Parquet’s appendFile for row‑group merging is the right approach and a performance win. Making it opt‑in via an action option and a table property is appropriate.

A couple of areas I’d like to discuss:

IO integration: Would it make sense to route IO through table.io()/OutputFileFactory rather than Hadoop IO?
Executor/driver split: Should executors only write files and return locations/sizes, with DataFiles (and metrics) constructed on the driver?

I’d also like to get others’ opinions. @pvary @amogh-jahagirdar @nastra @singhpk234

Nov 03 '25 06:11 huaxingao

I have a few concerns here:

I would prefer if the decision to do the row-group level merging is done on the action level, and not leaked to the table properties
I would prefer to check the requirements as soon as possible and fail, or fall back with logging to the normal rewrite if the requirements are not met
In the planning we can create groups with the expected sizes, and in this case the runner could rewrite the whole groups, and don't need to split the planned groups to the expected file sizes
Always using HadoopFileIO could be problematic. The catalog might define a different FileIO implementation. We should handle the case correctly, and use the Catalog/Table provided FileIo
We don't reuse the ParquetFileMerger object. In this case, I usually prefer to use static methods.

Nov 03 '25 15:11 pvary

Thanks @huaxingao for the review and feedback! I've addressed both of your points.

IO integration: Good catch! I've updated the implementation to use table.io() instead of hardcoding HadoopFileIO. The new approach:

Executors still use Hadoop Configuration for the actual Parquet file merging (since ParquetFileMerger internally uses Parquet's appendFile which requires Hadoop APIs)
Driver now uses table.io().newInputFile() to read metrics, which properly respects the catalog's configured FileIO implementation
This ensures compatibility with different storage systems (S3, GCS, Azure, custom FileIO implementations)

Executor/driver split: I've refactored to follow the recommended pattern:

Executors: Only perform the file merge operation and return lightweight metadata (file path, size) via a MergeResult object
Driver: Receives the metadata, reads metrics using table.io(), and constructs the full DataFile objects
This minimizes serialization overhead and keeps heavyweight objects on the driver side

Additionally, I've addressed the architectural feedback about planning vs. execution:

Removed the groupFilesBySize() logic from the runner - the planner already creates appropriately-sized groups
Runner now merges the entire file group into a single output file without further splitting
This creates a cleaner separation where planning happens in the planner and execution happens in the runner

Let me know if there are any other concerns or improvements you'd like to see!

Nov 04 '25 15:11 shangxinli

Thanks @pvary for the detailed feedback! I've addressed your points:

Decision at action level, not table properties: Done - Removed the PARQUET_USE_FILE_MERGER from TableProperties entirely and stripped out the fallback logic. Now it only checks the action options.
Early validation with proper fallback: Done - Flipped the logic around to validate upfront with canUseMerger() before attempting the merge. Also beefed up the validation to actually check schema compatibility, not just file format. If anything fails, it logs and falls back to the standard rewrite.
Planning creates expected sizes, runner doesn't split: Done - Nuked the whole groupFilesBySize() method. The runner now just merges whatever the planner gave it into a single file - no more re-grouping.
Use Catalog/Table FileIO instead of HadoopFileIO: Done - Removed HadoopFileIO completely. Executors now just return path + size, and the driver reads metrics using table().io() which respects whatever FileIO the catalog configured.
Static methods instead of object creation: Done - Converted ParquetFileMerger to a utility class with private constructor and all static methods. No more new ParquetFileMerger() calls anywhere.

Nov 04 '25 15:11 shangxinli

Thanks for the PR ! I have a question about the lineage, If the merging is only performed at the parquet layer, will the lineage information of the v3 table be disrupted?

Nov 12 '25 13:11 Guosmilesmile

Thanks for the PR ! I have a question about the lineage, If the merging is only performed at the parquet layer, will the lineage information of the v3 table be disrupted?

Good question! The lineage information for v3 tables is preserved in two ways:

Field IDs (Schema Lineage)

Field IDs are preserved because we strictly enforce identical schemas across all files being merged.

In ParquetFileMerger.java:130-136, we validate that all input files have exactly the same Parquet MessageType schema:

if (!schema.equals(currentSchema)) { throw new IllegalArgumentException( String.format("Schema mismatch detected: file '%s' has schema %s but file '%s' has schema %s. " + "All files must have identical Parquet schemas for row-group level merging.", ...)); }

Field IDs are stored directly in the Parquet schema structure itself (via Type.getId()), so when we copy row groups using ParquetFileWriter.appendFile() with the validated schema, all field IDs are preserved.

Row IDs (Row Lineage for v3+)

Row IDs are automatically assigned by Iceberg's commit framework - we don't need special handling in the merger.

Here's how it works:

Our code creates DataFile objects with metrics (including recordCount) but without firstRowId - see SparkParquetFileMergeRunner.java:236-243
During commit, SnapshotProducer creates a ManifestListWriter initialized with base.nextRowId() (the table's current row ID counter) - see SnapshotProducer.java:273
ManifestListWriter.prepare() automatically assigns firstRowId to each manifest and increments the counter by the number of rows - see ManifestListWriter.java:136-140: // assign first-row-id and update the next to assign wrapper.wrap(manifest, nextRowId); this.nextRowId += manifest.existingRowsCount() + manifest.addedRowsCount();
The snapshot is committed with the updated nextRowId, ensuring all row IDs are correctly tracked

This is the same mechanism used by all Iceberg write operations, so row lineage is fully preserved for v3 tables.

Nov 16 '25 01:11 shangxinli

Summary of Changes Made to Address PR Comments in Round 3:

Removed "zero-copy" terminology

Files: RewriteDataFiles.java, ParquetFileMerger.java
Change: Replaced "zero-copy" with more accurate "directly copying row groups without deserialization and re-serialization"

Added missing restrictions to javadoc

Files: RewriteDataFiles.java, ParquetFileMerger.java
Added:
- Files must not have associated delete files or delete vectors
- Table must not have a sort order (including z-ordered tables)
- All files must have compatible schemas (in ParquetFileMerger)

Replaced deprecated ParquetFileWriter constructor

File: ParquetFileMerger.java
Change: Switched from deprecated 5-parameter constructor to non-deprecated 8-parameter constructor with proper defaults

Made configuration values user-configurable

File: SparkParquetFileMergeRunner.java, ParquetFileMerger.java
Added:
- rowGroupSize: Read from table property write.parquet.row-group-size-bytes (default 128MB)
- columnIndexTruncateLength: Read from Hadoop config parquet.columnindex.truncate.length (default 64)

Implemented encryption check

File: SparkParquetFileMergeRunner.java
Change: Added ParquetCryptoRuntimeException catch to detect encrypted files

Removed custom grouping logic (planner concern)

File: SparkParquetFileMergeRunner.java
Change: Replaced groupFilesBySize() with distributeFilesEvenly() to trust planner's expectedOutputFiles calculation

Updated documentation

File: docs/docs/maintenance.md
Added: New section documenting Parquet row-group level merging feature with requirements and usage

Nov 16 '25 19:11 shangxinli

Thanks for the PR ! I have a question about the lineage, If the merging is only performed at the parquet layer, will the lineage information of the v3 table be disrupted?

Good question! The lineage information for v3 tables is preserved in two ways:

Field IDs (Schema Lineage)

Field IDs are preserved because we strictly enforce identical schemas across all files being merged.

In ParquetFileMerger.java:130-136, we validate that all input files have exactly the same Parquet MessageType schema:

if (!schema.equals(currentSchema)) { throw new IllegalArgumentException( String.format("Schema mismatch detected: file '%s' has schema %s but file '%s' has schema %s. " + "All files must have identical Parquet schemas for row-group level merging.", ...)); }

Field IDs are stored directly in the Parquet schema structure itself (via Type.getId()), so when we copy row groups using ParquetFileWriter.appendFile() with the validated schema, all field IDs are preserved.

Row IDs (Row Lineage for v3+)

Row IDs are automatically assigned by Iceberg's commit framework - we don't need special handling in the merger.

Here's how it works:

Our code creates DataFile objects with metrics (including recordCount) but without firstRowId - see SparkParquetFileMergeRunner.java:236-243

During commit, SnapshotProducer creates a ManifestListWriter initialized with base.nextRowId() (the table's current row ID counter) - see SnapshotProducer.java:273

ManifestListWriter.prepare() automatically assigns firstRowId to each manifest and increments the counter by the number of rows - see ManifestListWriter.java:136-140: // assign first-row-id and update the next to assign wrapper.wrap(manifest, nextRowId); this.nextRowId += manifest.existingRowsCount() + manifest.addedRowsCount();

The snapshot is committed with the updated nextRowId, ensuring all row IDs are correctly tracked

This is the same mechanism used by all Iceberg write operations, so row lineage is fully preserved for v3 tables.

If we continue to rewire data files across three snapshots, and none of the data files contain row IDs, then according to my understanding, the rewritten output will be a single data file whose data may not be arranged in commit order—for example, snapshot-1, snapshot-3, snapshot-2. If the original row IDs are not written into the data file, the lineage tracking will be incorrect. Additionally, if rewireDataFile is run concurrently under different groups without writing the row IDs, that will also cause problems.

This is my current concern about whether lineage can be preserved when merging only at the Parquet level.

Nov 17 '25 05:11 Guosmilesmile

I agree with @Guosmilesmile that for V3 tables, row lineage tracking is broken because the original files don’t contain row_id; they inherit them later. Consider this scenario:

Commit 1 adds File 1 (50 rows), sets first_row_id = 0, but doesn’t assign row_ids. The row_ids are 0, 1, 2, 3..., 49.
Commit 2 adds File 2 (50 rows), sets first_row_id = 50, but doesn’t assign row_ids. The row_ids are 50, 51, 52, 53..., 99.
Commit 3 adds File 3 (50 rows), sets first_row_id = 100, but doesn’t assign row_ids. The row_ids are 100, 101, 102, 103..., 149.
Commit 4 performs compaction, merging File 1 and File 2. The commit sets first_row_id = 150, and since the new file doesn’t contain row_ids, they are assigned by the current algorithm starting from 150. The row_ids are 150, 151, 152, 153..., 249.

As a result, the compacted rows receive new row_ids, which breaks lineage tracking.

Nov 17 '25 08:11 pvary

Great point and example! You're right! I think there could be two solutions:

Solution 1: Group files by row ID continuity

Group files with continuous _row_id ranges, preserve IDs by setting merged file's firstRowId to match.

Solution 2: Write _row_id as physical column

Read virtual _row_id from each row and write as physical column.

Please share your thoughts.

Nov 18 '25 20:11 shangxinli

Great point and example! You're right! I think there could be two solutions:

Solution 1: Group files by row ID continuity

Group files with continuous _row_id ranges, preserve IDs by setting merged file's firstRowId to match.

This is typically not an option, and very hard to find continuous stretches of row_ids.

Solution 2: Write _row_id as physical column

Read virtual _row_id from each row and write as physical column.

Can we do this without rewriting the rowgroup? Do we still have gains compared to the "normal" read/write compaction?

Nov 19 '25 09:11 pvary

This is typically not an option, and very hard to find continuous stretches of row_ids.

I agree there are many cases that cause discontinuity. However, for files written in consecutive commits, e.g. append-only datasets, the row IDs are continuous (e.g., File 1: rows 0-49, File 2: rows 50-99 in your example). When merging File 1 and File 2, the merged file could keep first_row_id=0 and span rows 0-99. This is a common use case: when first writing, we create many small files by increasing parallelism, and right after that we merge them into larger files.

Can we do this without rewriting the rowgroup? Do we still have gains compared to the "normal" read/write compaction

Yes, we can, and there are still significant gains. Parquet's columnar format allows us to copy most column chunks directly while rewriting only the _row_id column. Hudi's row-group rewriter does exactly this with the _file_name column. They rewrite one column while copying all other columns, and still achieve ~10x performance improvement over full read/write compaction.

Nov 19 '25 18:11 shangxinli

Option 1 is very situational. If Option 2 could have similar gains, the I'm all for it. What about cases where some rows has value for row-id, and the other row-ids are needed to be calculated by default values.

Nov 19 '25 18:11 pvary

Yeah, Option 1 is very situational. But it is quite common case that the data is freshly written and them quickly merge them. Option 2 would have much more complex code change and also adding a column would have some storage overhead.

How about we implement option 1 in this PR and then we can have separate PR as enhancement with a feature flag?

Nov 20 '25 14:11 shangxinli

Option 1 needs specific planning too. I would prefer Option 2

Nov 20 '25 16:11 pvary

Hmm... we can do option 2 in next PR. Otherwise, it would be a super large PR. Also, adding storage overhead might be a concern since many company emphasize cost efficiency now. Any specific concern on changing the planning?

Nov 20 '25 17:11 shangxinli

I think it would be very complicated and only applicable for an edge case. Also, we should check if we can decrease the first_row_id. This might cause issues with concurrent commits

Nov 21 '25 06:11 pvary

I think it would be very complicated and only applicable for an edge case. Also, we should check if we can decrease the first_row_id. This might cause issues with concurrent commits

Just pushed for the option 2.

Nov 24 '25 05:11 shangxinli

Made the changes to address feedbacks.

ParquetFileMerger.java

Renamed readAndValidateSchema() → canMerge() returning boolean instead of MessageType
Added row_id null validation directly into canMerge() method
Added validateRowIdColumnHasNoNulls() method to check physical _row_id columns have no null values
Changed exception handling from IllegalArgumentException | IOException to RuntimeException | IOException to catch ParquetCryptoRuntimeException
Updated Javadoc examples and documentation

ParquetUtil.java

Added public constants COLUMN_INDEX_TRUNCATE_LENGTH and DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH (moved from ParquetWriter)

ParquetWriter.java

Removed constants (moved to ParquetUtil)
Updated to reference ParquetUtil constants

TestParquetFileMerger.java

Created comprehensive unit tests for ParquetFileMerger
Tests for canMerge() with various scenarios (empty lists, non-Parquet files, schema mismatches, etc.)

SparkParquetFileMergeRunner.java

Renamed validateAndGetSchema() → canMerge() returning boolean
Removed ValidationResult dependency
Simplified to use canMerge() for validation instead of returning schema/metadata
Schema and metadata now read on executor from input files (reducing serialization)

TestSparkParquetFileMergeRunner.java

Updated tests to use canMerge() instead of validateAndGetSchema()
Removed reflection-based testing code
Simplified test assertions to work with boolean return values

Nov 26 '25 06:11 shangxinli

Summary of Changes

Added _last_updated_sequence_number Support for Row Lineage (pvary's comment)

File: parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java
Extended row lineage to include both _row_id AND _last_updated_sequence_number columns
Renamed methods: addRowIdColumn() → addRowLineageColumns(), mergeFilesWithRowIdsAndSchema() → mergeFilesWithRowLineageAndSchema()
Added writeSequenceNumberColumnChunk() method to write the sequence number column
Updated mergeFiles() signature to accept dataSequenceNumbers parameter
The _last_updated_sequence_number is preserved from the original file's dataSequenceNumber (not a new value)
File: spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java
Extract dataSequenceNumbers from DataFile objects (line 239-240)
Updated MergeTaskInfo class to include dataSequenceNumbers field
Pass dataSequenceNumbers to ParquetFileMerger.mergeFiles()

Refactored Code to Eliminate Duplication (user feedback)

File: parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java
Created utility helper writeLongColumnChunk() (lines 527-574) that accepts a LongUnaryOperator for value generation
Both writeRowIdColumnChunk() and writeSequenceNumberColumnChunk() now delegate to this helper
Eliminated ~80 lines of duplicated code

Used Parquet's Built-in Constants (pvary's comment)

Files: ParquetUtil.java, ParquetWriter.java, SparkParquetFileMergeRunner.java
Removed custom constants: COLUMN_INDEX_TRUNCATE_LENGTH and DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH
Now using upstream constants:
- ParquetOutputFormat.COLUMN_INDEX_TRUNCATE_LENGTH
- ParquetProperties.DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH

Added Comprehensive Row Lineage Tests (pvary's comment)

File: parquet/src/test/java/org/apache/iceberg/parquet/TestParquetFileMerger.java
Added 6 new tests: a. testMergeFilesSynthesizesRowLineageColumns() - Verifies synthesis of both columns b. testMergeFilesWithMultipleRowGroups() - Tests across multiple row groups c. testMergeFilesWithDifferentDataSequenceNumbers() - Verifies sequence number preservation d. testMergeFilesWithoutRowLineage() - Tests merging without row lineage e. testMergeFilesWithPhysicalRowLineageColumns() - Tests binary copy of existing columns f. testCanMergeReturnsFalseForPhysicalRowLineageWithNulls() - Validates null rejection
Updated existing tests to use files with data instead of empty files

Changed Log Levels for Fallback Scenarios (user request)

File: spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java
Changed 3 log statements from WARN to INFO (lines 104, 168, 176)
These represent normal fallback scenarios, not errors

Added Two New Validation Checks (pvary's comments)

File: spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java
Partition spec change check (lines 157-167): Falls back if partition spec is changing (row-group merge cannot repartition data)
File splitting check (lines 169-181): Falls back if any input file is larger than target output size (row-group merge is optimized for merging, not splitting)
Updated javadoc to document these new checks

Updated OutputFileFactory Creation Pattern (pvary's comment)

File: spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java
Changed from: OutputFileFactory.builderFor(table(), spec.specId(), batchIndex)
Changed to: OutputFileFactory.builderFor(table(), batchIndex, 0).defaultSpec(spec)

Kept Defensive Validation (pvary's earlier comment)

File: parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java
Maintained validateRowLineageColumnsHaveNoNulls() check in mergeFiles() even though it's also called in canMerge()
This is defensive programming since mergeFiles() is a public API that can be called directly

Files Modified

parquet/src/main/java/org/apache/iceberg/parquet/ParquetFileMerger.java
spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/actions/SparkParquetFileMergeRunner.java
parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java
parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java
parquet/src/test/java/org/apache/iceberg/parquet/TestParquetFileMerger.java

Key Design Decisions

Row lineage is treated as a cohesive unit (both _row_id and _last_updated_sequence_number together)
Preserved dataSequenceNumber from original files (not creating new sequence numbers)
Used DELTA_BINARY_PACKED encoding for both row lineage columns (optimal compression)
Fallback to standard Spark rewrite is graceful and automatic when optimization cannot be used

Nov 26 '25 17:11 shangxinli

Data Files Records per Data File Normal Merge (s) Parquet Merge (s)

5 100*10000 4.7 0.9

5 10*10000 1.8 0.9

5 1*10000 0.8 0.9

40 10*10000 6.7 4.4

40 1*10000 4.1 3.6

40 100 3.4 3.4

100 1000 12.0 8.4

Data Files	Records per Data File	Normal Merge (s)	Parquet Merge (s)
5	100*10000	4.7	0.9
5	10*10000	1.8	0.9
5	1*10000	0.8	0.9
40	10*10000	6.7	4.4
40	1*10000	4.1	3.6
40	100	3.4	3.4
100	1000	12.0	8.4

I ran some tests comparing Parquet merge and normal merge; the Parquet merge version I used is the original one without any lineage-related changes. From the current results, when there are many files with small contents, the performance advantage of Parquet merge is not particularly large; when files contain many rows, the advantage is significant. I suspect that validating the schema and reading the footer for every file introduces additional overhead.

I suggest that once the lineage part is ready, we add corresponding tests, because adding lineage will introduce more complexity.

These were manual tests; results may vary and are for reference only.

Nov 27 '25 11:11 Guosmilesmile

Data Files Records per Data File Normal Merge (s) Parquet Merge (s) 5 10010000 4.7 0.9 5 1010000 1.8 0.9 5 110000 0.8 0.9 40 1010000 6.7 4.4 40 1*10000 4.1 3.6 40 100 3.4 3.4 100 1000 12.0 8.4 I ran some tests comparing Parquet merge and normal merge; the Parquet merge version I used is the original one without any lineage-related changes. From the current results, when there are many files with small contents, the performance advantage of Parquet merge is not particularly large; when files contain many rows, the advantage is significant. I suspect that validating the schema and reading the footer for every file introduces additional overhead.

I suggest that once the lineage part is ready, we add corresponding tests, because adding lineage will introduce more complexity.

These were manual tests; results may vary and are for reference only.

Thanks for running the tests and sharing the observations — they generally match the expected behavior.

The performance characteristics also depend heavily on record size. When records are very small, the overhead of introducing physical row lineage becomes more noticeable relative to the actual data processing. Likewise, when files themselves are extremely small (e.g., ~100 records), the cost of validating the schema and reading the footer is comparable to reading the data itself, so the advantage of Parquet merge is naturally limited. To avoid the schema reading per file, if we can have schemaId for DataFile metadata, we can optimize it.

In practice, for typical rewrite workloads, most of the time is spent on record-level compression/decompression and encoding/decoding rather than footer or schema handling. This is the dominant cost driver, and it is where Parquet merge shows more substantial benefits as row counts increase.

Nov 27 '25 22:11 shangxinli

@shangxinli: Here is a bit of refactor to fix warnings, and such: https://github.com/pvary/iceberg/commit/884ebeabb5b08b1e52d08f219e305bfc46e9940e

Could you please take a look, and apply which is OK with you, and we can talk about the others?

Dec 15 '25 17:12 pvary

I have seen this: #14853 - Maybe we can't use RLE? What is the normal Parquet writer using?

Dec 16 '25 11:12 pvary

@shangxinli: Here is a bit of refactor to fix warnings, and such: pvary@884ebea

Could you please take a look, and apply which is OK with you, and we can talk about the others?

Refactored: 5339589

Dec 21 '25 00:12 shangxinli

I have seen this: #14853 - Maybe we can't use RLE? What is the normal Parquet writer using?

We should use RLE because it's the standard encoding for definition/repetition levels in Apache Parquet.

In the parquet document, BIT_PACKED is "deprecated and will be replaced by the RLE/bit-packing hybrid encoding"
In parquet code, we create RunLengthBitPackingHybridValuesWriter for def/rep levels, in which getEncoding() returns Encoding.RLE

Our code (ParquetFileMerger.java:537-539):

Line 537-538: Encoding.RLE for definition/repetition levels
Line 539: Encoding.DELTA_BINARY_PACKED for data values

PR #14853 is about reading RLE-encoded data pages, not def/rep levels. Since we use DELTA_BINARY_PACKED for data (line 539), that PR doesn't affect us.

Dec 21 '25 01:12 shangxinli