paimon icon indicating copy to clipboard operation
paimon copied to clipboard

[core/iceberg] Added optional snapshot summary fields to iceberg metadata

Open 0dunay0 opened this issue 2 months ago • 1 comments

Purpose

Linked issue: close #6502

This PR fixes Redshift Spectrum querying for Paimon tables with Iceberg compatibility by populating optional snapshot summary fields that are required by certain Iceberg query engines.

When Paimon generates Iceberg metadata, it currently only includes the operation field in snapshot summaries. While the Iceberg specification marks most summary fields as "optional," some query engines (notably AWS Redshift Spectrum) require fields like total-records to successfully parse and query tables.

This causes Paimon+Iceberg tables to be queryable in AWS Athena but fail in Redshift Spectrum with error: Required field total-records missing.

Changes

Added computeSnapshotSummary() Helper Method

Aggregates statistics from IcebergManifestFileMeta objects to compute snapshot-level metrics including:

Required fields (always present):

  • total-records - Total number of live records
  • total-data-files - Total number of live data files
  • total-delete-files - Total number of live delete files
  • total-position-deletes - Total position delete records
  • total-equality-deletes - Always "0" (Paimon doesn't use equality deletes)

Optional fields (when non-zero):

  • added-data-files, added-records, added-files-size
  • deleted-data-files, deleted-records, deleted-files-size
  • total-files-size
  • changed-partition-count

Tests

Updated IcebergMetadataTest.java

API and Format

N/A

Documentation

Reintroduces a feature that was previously available.

aws s3 cp s3://some-bucket/paimon/warehouse/somedb.db/some_table/metadata/v190.metadata.json - | jq '.snapshots[0].summary'

{
  "added-data-files": "2",
  "total-equality-deletes": "0",
  "added-records": "83282",
  "deleted-data-files": "0",
  "deleted-records": "0",
  "total-records": "83282",
  "deleted-files-size": "0",
  "changed-partition-count": "1",
  "total-position-deletes": "0",
  "added-files-size": "4683766",
  "total-delete-files": "0",
  "total-files-size": "4683766",
  "total-data-files": "2",
  "operation": "append"
}

Redshift Spectrum can now query the table.

0dunay0 avatar Oct 31 '25 16:10 0dunay0

@JingsongLi Can you take a look at your convenience please?

0dunay0 avatar Nov 03 '25 11:11 0dunay0