iceberg Cannot commit, found new delete for replaced data file

Query engine

iceberg-flink-runtime-1.14-0.14.0.jar，help me，Thanks

Question

org.apache.iceberg.exceptions.ValidationException: Cannot commit, found new delete for replaced data file: GenericDataFile{content=data, file_path=hdfs://dev-001:8020/iceberg/flink_hive_iceberg/flink_hive_db.db/test_repository_1/data/news_postdate=2022-07-31/00002-0-8b3590ea-a593-4734-b84a-a6084a426b95-00093.parquet, file_format= PARQUET, spec_id=0, partition=PartitionData{news_postdate=2022-07-31}, record_count=106, file_size_in_bytes=110049, column_sizes=null, value_counts=null, null_value_co unts=null, nan_value_counts=null, lower_bounds=null, upper_bounds=null, key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=null} at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:50) at org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:418) at org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:367) at org.apache.iceberg.BaseRewriteFiles.validate(BaseRewriteFiles.java:108) at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:175) at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:296) at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404) at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:214) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:198) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:190) at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:295) at org.apache.iceberg.actions.BaseSnapshotUpdateAction.commit(BaseSnapshotUpdateAction.java:41) at org.apache.iceberg.actions.BaseRewriteDataFilesAction.doReplace(BaseRewriteDataFilesAction.java:298) at org.apache.iceberg.actions.BaseRewriteDataFilesAction.replaceDataFiles(BaseRewriteDataFilesAction.java:277) at org.apache.iceberg.actions.BaseRewriteDataFilesAction.execute(BaseRewriteDataFilesAction.java:252)

Aug 01 '22 10:08 zhushanwei

I used the upsert mode

Aug 01 '22 10:08 zhushanwei

The reason is Iceberg find some delete files associated with the data file that you want to rewrite. Can you show me your case?

Aug 02 '22 07:08 hzluting

The reason is Iceberg find some delete files associated with the data file that you want to rewrite. Can you show me your case?

//writer data Configuration conf = new Configuration(); Map<String, String> properties = new HashMap<>(); CatalogLoader catalogLoader = CatalogLoader.hive(Constants.CATALOG_NAME, conf, properties); TableIdentifier tableIdentifier = TableIdentifier.of(Constants.DATABASE_NAME, Constants.TABLE_NAME); TableLoader tableLoader = TableLoader.fromCatalog(catalogLoader, tableIdentifier); FlinkSink.forRowData(input) .writeParallelism(parallelism) .tableLoader(tableLoader) .upsert(true) .overwrite(false) .append();

//rewriteDataFiles StreamExecutionEnvironment env = FlinkEnvironment.getEnvironment(parallelism); tableLoader.open(); Table table = tableLoader.loadTable();

Actions.forTable(env, table) .rewriteDataFiles() .maxParallelism(parallelism) .targetSizeInBytes(256 * 1024 * 1024) .filter(Expressions.equal("news_postdate", newsPostdate)) .execute(); Is there any other way to solve this problem？Thanks

Aug 02 '22 10:08 zhushanwei

I also met this probelm in the same case. It's not "some delete files associated with the data file" casue this problem. Add log in the tail of https://github.com/apache/iceberg/blob/5a15efc070ab59eeda6343998aa065c0c9892c5c/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java#L151 to print the data file path, delete file path, lower and upper. And you can see the upper and lower filepath info is not complete filepath, but truncate 16 bit. This can lead to false positives when determining whether a data file references a deleted file. From the source code https://github.com/apache/iceberg/blob/5a15efc070ab59eeda6343998aa065c0c9892c5c/core/src/main/java/org/apache/iceberg/MetricsConfig.java#L52 you can see the DEFAULT_WRITE_METRICS_MODE_DEFAULT is truncate(16). The upper and lower information of the filepath was intercepted when the data file was generated, which lead to the misjudgment when commit in rewrite data. To resolve this problem, add a property like this when create table. alter table iceberg_table set tblproperties ( 'write.metadata.metrics.default'='full' );

Aug 12 '22 08:08 Shane-Yu

I also met this probelm in the same case. It's not "some delete files associated with the data file" casue this problem. Add log in the tail of

https://github.com/apache/iceberg/blob/5a15efc070ab59eeda6343998aa065c0c9892c5c/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java#L151

to print the data file path, delete file path, lower and upper. And you can see the upper and lower filepath info is not complete filepath, but truncate 16 bit. This can lead to false positives when determining whether a data file references a deleted file. From the source code https://github.com/apache/iceberg/blob/5a15efc070ab59eeda6343998aa065c0c9892c5c/core/src/main/java/org/apache/iceberg/MetricsConfig.java#L52

you can see the DEFAULT_WRITE_METRICS_MODE_DEFAULT is truncate(16). The upper and lower information of the filepath was intercepted when the data file was generated, which lead to the misjudgment when commit in rewrite data. To resolve this problem, add a property like this when create table. alter table iceberg_table set tblproperties ( 'write.etadata.metrics.default'='full' );

Yes, you are right! I'm wondering why I can't reproduce this problem. But the property key is write.metadata.metrics.default

Aug 12 '22 09:08 hzluting

Is this problem solved? i met the same error

Sep 22 '22 09:09 chenwyi2

Is this problem solved? i met the same error

create table: 'write.distribution-mode'='hash', 'commit.manifest.min-count-to-merge'='2', 'format-version'='2', 'write.upsert.enable'='true', 'write.metadata.metrics.default'='full', 'write.metadata.delete-after-commit.enabled'='true', 'write.metadata.previous-versions-max'='1'

writer data: FlinkSink.forRowData(input) .writeParallelism(parallelism) .tableLoader(tableLoader) .overwrite(false) .append();

I used these configurations,The result is normal

Oct 17 '22 10:10 zhushanwei

@RussellSpitzer @stevenzwu This should be a bug. Is it possible to solve it by default setting the metrics of the file_path column to full?

Nov 01 '22 07:11 lintingbin

@Shane-Yu i add 'write.metadata.metrics.default'='full' into table ,and i print the log messae with upper, "upper java.nio.HeapByteBuffer[pos=0 lim=16 cap=16],fromByteBuffer qbfs://online010", "pos=0 lim=16 cap=16" is still truncate 16 bit? it doesn't work?

Nov 15 '22 10:11 chenwyi2

'write.distribution-mode'='hash', 'commit.manifest.min-count-to-merge'='2', 'format-version'='2', 'write.upsert.enable'='true', 'write.metadata.metrics.default'='full', 'write.metadata.delete-after-commit.enabled'='true', 'write.metadata.previous-versions-max'='1'

Your configurations 'write.upsert.enable'='true' is wrong,so your mode not using upsert-enabled. Do you have another solution method?Anyway, thanks!

Apr 04 '23 11:04 skywalker2256

iceberg iceberg copied to clipboard

Cannot commit, found new delete for replaced data file

Query engine

Question

iceberg
iceberg copied to clipboard