iceberg
iceberg copied to clipboard
rewrite_table_path throws "AlreadyExistsException: Location already exists" on rewriting positional deletes
Apache Iceberg version
1.10.0 (latest release)
Query engine
Spark
Please describe the bug 🐞
I'm running Iceberg 1.10.0-amzn-0, the AWS-specific implementation for emr-7.12.0 with Spark 3.5.6.
I suspect that this isn't an AWS-specific bug, but if my suspicion is wrong, feel free to close this issue.
For tables without positional deletes the procedure finishes successfully.
For tables with positional deletes however, when I run the procedure rewrite_table_path like so:
CALL glue.system.rewrite_table_path(
table => 'glue.source_name.table_name',
source_prefix => 's3://bucket/iceberg/source_name/table_name',
target_prefix => 'target_prefix_test/iceberg/source_name/table_name',
staging_location => 's3://bucket/iceberg_rewrite_test/source_name/table_name'
)
I run into this error:
org.apache.iceberg.exceptions.AlreadyExistsException: Location already exists: s3://bucket/iceberg_rewrite_test/source_name/table_name/data/timestamp_month=2025-08/00000-190-b955afb8-8873-4d92-8546-c7be85bbccda-00002-deletes.parquet
at org.apache.iceberg.aws.s3.S3OutputFile.create(S3OutputFile.java:107)
at org.apache.iceberg.parquet.ParquetIO$ParquetOutputFile.create(ParquetIO.java:149)
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:473)
at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:431)
at org.apache.iceberg.parquet.ParquetWriter.ensureWriterInitialized(ParquetWriter.java:114)
at org.apache.iceberg.parquet.ParquetWriter.flushRowGroup(ParquetWriter.java:214)
at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:258)
at org.apache.iceberg.deletes.PositionDeleteWriter.close(PositionDeleteWriter.java:92)
at org.apache.iceberg.RewriteTablePathUtil.rewritePositionDeleteFile(RewriteTablePathUtil.java:633)
at org.apache.iceberg.spark.actions.RewriteTablePathSparkAction.lambda$rewritePositionDelete$a4760a1f$1(RewriteTablePathSparkAction.java:673)
at org.apache.spark.sql.Dataset.$anonfun$foreach$2(Dataset.scala:3553)
at org.apache.spark.sql.Dataset.$anonfun$foreach$2$adapted(Dataset.scala:3553)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1047)
at org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1047)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2545)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:174)
at org.apache.spark.scheduler.Task.run(Task.scala:152)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:632)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:96)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:635)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
The specific file it fails on is seemingly random. Looks like a race condition where the same file is written in multiple tasks.
I will attempt to create a tiny table for which this problem occurs to make it reproducible.
Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time