iceberg
iceberg copied to clipboard
Cannot commit, found new delete for replaced data file
Query engine
iceberg-flink-runtime-1.14-0.14.0.jar,help me,Thanks
Question
org.apache.iceberg.exceptions.ValidationException: Cannot commit, found new delete for replaced data file: GenericDataFile{content=data, file_path=hdfs://dev-001:8020/iceberg/flink_hive_iceberg/flink_hive_db.db/test_repository_1/data/news_postdate=2022-07-31/00002-0-8b3590ea-a593-4734-b84a-a6084a426b95-00093.parquet, file_format= PARQUET, spec_id=0, partition=PartitionData{news_postdate=2022-07-31}, record_count=106, file_size_in_bytes=110049, column_sizes=null, value_counts=null, null_value_co unts=null, nan_value_counts=null, lower_bounds=null, upper_bounds=null, key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=null} at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:50) at org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:418) at org.apache.iceberg.MergingSnapshotProducer.validateNoNewDeletesForDataFiles(MergingSnapshotProducer.java:367) at org.apache.iceberg.BaseRewriteFiles.validate(BaseRewriteFiles.java:108) at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:175) at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:296) at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:404) at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:214) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:198) at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:190) at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:295) at org.apache.iceberg.actions.BaseSnapshotUpdateAction.commit(BaseSnapshotUpdateAction.java:41) at org.apache.iceberg.actions.BaseRewriteDataFilesAction.doReplace(BaseRewriteDataFilesAction.java:298) at org.apache.iceberg.actions.BaseRewriteDataFilesAction.replaceDataFiles(BaseRewriteDataFilesAction.java:277) at org.apache.iceberg.actions.BaseRewriteDataFilesAction.execute(BaseRewriteDataFilesAction.java:252)
I used the upsert mode
The reason is Iceberg find some delete files associated with the data file that you want to rewrite. Can you show me your case?
The reason is Iceberg find some delete files associated with the data file that you want to rewrite. Can you show me your case?
//writer data Configuration conf = new Configuration(); Map<String, String> properties = new HashMap<>(); CatalogLoader catalogLoader = CatalogLoader.hive(Constants.CATALOG_NAME, conf, properties); TableIdentifier tableIdentifier = TableIdentifier.of(Constants.DATABASE_NAME, Constants.TABLE_NAME); TableLoader tableLoader = TableLoader.fromCatalog(catalogLoader, tableIdentifier); FlinkSink.forRowData(input) .writeParallelism(parallelism) .tableLoader(tableLoader) .upsert(true) .overwrite(false) .append();
//rewriteDataFiles StreamExecutionEnvironment env = FlinkEnvironment.getEnvironment(parallelism); tableLoader.open(); Table table = tableLoader.loadTable();
Actions.forTable(env, table) .rewriteDataFiles() .maxParallelism(parallelism) .targetSizeInBytes(256 * 1024 * 1024) .filter(Expressions.equal("news_postdate", newsPostdate)) .execute(); Is there any other way to solve this problem?Thanks
I also met this probelm in the same case. It's not "some delete files associated with the data file" casue this problem. Add log in the tail of https://github.com/apache/iceberg/blob/5a15efc070ab59eeda6343998aa065c0c9892c5c/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java#L151 to print the data file path, delete file path, lower and upper. And you can see the upper and lower filepath info is not complete filepath, but truncate 16 bit. This can lead to false positives when determining whether a data file references a deleted file. From the source code https://github.com/apache/iceberg/blob/5a15efc070ab59eeda6343998aa065c0c9892c5c/core/src/main/java/org/apache/iceberg/MetricsConfig.java#L52 you can see the DEFAULT_WRITE_METRICS_MODE_DEFAULT is truncate(16). The upper and lower information of the filepath was intercepted when the data file was generated, which lead to the misjudgment when commit in rewrite data.
To resolve this problem, add a property like this when create table.
alter table iceberg_table set tblproperties ( 'write.metadata.metrics.default'='full' );
I also met this probelm in the same case. It's not "some delete files associated with the data file" casue this problem. Add log in the tail of
https://github.com/apache/iceberg/blob/5a15efc070ab59eeda6343998aa065c0c9892c5c/core/src/main/java/org/apache/iceberg/DeleteFileIndex.java#L151
to print the data file path, delete file path, lower and upper. And you can see the upper and lower filepath info is not complete filepath, but truncate 16 bit. This can lead to false positives when determining whether a data file references a deleted file. From the source code https://github.com/apache/iceberg/blob/5a15efc070ab59eeda6343998aa065c0c9892c5c/core/src/main/java/org/apache/iceberg/MetricsConfig.java#L52
you can see the DEFAULT_WRITE_METRICS_MODE_DEFAULT is truncate(16). The upper and lower information of the filepath was intercepted when the data file was generated, which lead to the misjudgment when commit in rewrite data. To resolve this problem, add a property like this when create table.
alter table iceberg_table set tblproperties ( 'write.etadata.metrics.default'='full' );
Yes, you are right! I'm wondering why I can't reproduce this problem. But the property key is write.metadata.metrics.default
Is this problem solved? i met the same error
Is this problem solved? i met the same error
create table: 'write.distribution-mode'='hash', 'commit.manifest.min-count-to-merge'='2', 'format-version'='2', 'write.upsert.enable'='true', 'write.metadata.metrics.default'='full', 'write.metadata.delete-after-commit.enabled'='true', 'write.metadata.previous-versions-max'='1'
writer data: FlinkSink.forRowData(input) .writeParallelism(parallelism) .tableLoader(tableLoader) .overwrite(false) .append();
I used these configurations,The result is normal
@RussellSpitzer @stevenzwu This should be a bug. Is it possible to solve it by default setting the metrics of the file_path column to full?
@Shane-Yu i add 'write.metadata.metrics.default'='full' into table ,and i print the log messae with upper, "upper java.nio.HeapByteBuffer[pos=0 lim=16 cap=16],fromByteBuffer qbfs://online010", "pos=0 lim=16 cap=16" is still truncate 16 bit? it doesn't work?
'write.distribution-mode'='hash', 'commit.manifest.min-count-to-merge'='2', 'format-version'='2', 'write.upsert.enable'='true', 'write.metadata.metrics.default'='full', 'write.metadata.delete-after-commit.enabled'='true', 'write.metadata.previous-versions-max'='1'
Your configurations 'write.upsert.enable'='true' is wrong,so your mode not using upsert-enabled. Do you have another solution method?Anyway, thanks!