trino icon indicating copy to clipboard operation
trino copied to clipboard

OPTIMIZE does not clean up equality delete files after update for singleton data files skip (regression after PR #23864)

Open nrutherford-w opened this issue 4 months ago • 1 comments

Issue

After upgrading to a version of Trino that includes PR #23864, we observed that OPTIMIZE no longer cleans up equality delete files in certain cases., even after new data and new deletes are added to partitions. This leads to equality delete files accumulating, which impacts performance and storage. Reverting the changes made in PR #23864 restores the expected behavior of the trino-iceberg plugin.

Symptoms

• Equality delete files remain in the table after OPTIMIZE operations. • This persists even after new data is added and more deletes are performed in the affected partitions. • The Trino logs show that many files are being skipped during OPTIMIZE, for example: INFO IcebergSplitSource Generated 2 splits, skipped 6 files for OPTIMIZE • Only some equality delete files are removed, leaving many behind: INFO CommitReport ... removedEqualityDeletes= 93, totalEqualityDeletes=124

Steps to reproduce

Consistent manual reproduction has yet to be achieved. This behavior presents itself during day-to-day operations.

Expected behavior

• OPTIMIZE should rewrite all files that are referenced by equality delete files, removing reliance on equality deletes from prior snapshots. • After all affected data files have been rewritten, equality delete files should be safely removed/cleaned up.

Actual behavior

• OPTIMIZE is not handling all equality deletes when executed, leaving behind delete files that are never cleaned up. • Data files that are “clean” (only file in partition) are being skipped by OPTIMIZE, even if they are still referenced by equality delete files. • This results in equality delete files that are effectively “stuck” and never cleaned up, unless forced by a full-table rewrite. Equality deletes have reached over 2 billion.

Logs

INFO IcebergSplitSource Generated 2 splits, skipped 6 files for OPTIMIZE INFO CommitReport ... removedEqualityDeleteFiles=93, totalEqualityDeletes=124

Environment

• Trino version: 464+ • Iceberg version: 1.6.1 • Catalog: REST • Table format: Iceberg V2

Additional context

This behavior appears to be a regression or unintended consequence of the logic introduced in PR #23864, which skips rewriting singleton files in partitions unless they have direct deletes.

Proposed solution

• OPTIMIZE should ensure that all equality delete files are released by rewriting the data files they reference. • The file selection logic should be updated to ensure that all files referenced by equality delete files are included in the rewrite.

Workarounds attempted

• Adding new data and deletes to affected partitions (did not resolve the issue). • Running OPTIMIZE multiple times (did not resolve the issue). • Increasing the file_size_threshold to ensure all data in each partition is not being split into multiple files due to the amount of data (did not resolve the issue). • Forcing a full rewrite by creating a new table and copying the data over (does resolve the issue). However, this is not practical as a regular workaround.

nrutherford-w avatar Jun 06 '25 16:06 nrutherford-w