[Bug]: GC causing "org.apache.iceberg.exceptions.NotFoundException: File does not exist"
What happened
After I used GC I started to get file does not exist error. Looks like the file was deleted but was not deleted from metadata.
https://github.com/apache/iceberg/issues/8338
How to reproduce it
- Create table
- Add data
- Run GC with GC command
- Try to read all data from table again
Nessie server type (docker/uber-jar/built from source) and version
kubernetes 0.99.0
Client type (Ex: UI/Spark/pynessie ...) and version
Spark
Additional information
No response
Maybe this is related as well https://github.com/projectnessie/nessie/issues/8263
@dorsegal can you provide more details, or better a reproducer?
My setup is with kafka iceberg connect. create a table and use kafka iceberg sink to write data into the table. After several commits run GC and try to run rewrite files using spark.
spark.sql( "CALL nessie.system.rewrite_data_files(table => 'table', options => map('partial-progress.enabled','true', 'max-concurrent-file-group-rewrites', '30'))" ).show()
I can provide more logs if needed just don't know which one. From GC logs I see that it deleted some files.
What I meant is a full reproducer mentioning every step starting from scratch, so that s/o can get to the same behavior on a "clean"/empty environment.
java -jar nessie-gc-0.99.0.jar gc
failed:
Caused by: java.lang.RuntimeException: Failed to get manifest files for ICEBERG_TABLE robot_dev.robot_data, content-ID fc122060-bf21-44d3-b776-fbecb2d23715 at commit 47a35e9867b0408c65feb09d7140a29d198354edf3a1aa0dc7cc09d192b07c27 via s3://ice-lake/robot_dev/robot_data/metadata/00000-6b65b94d-6370-4c43-9baa-b40ed0770c5d.metadata.json
After expire snapshot in Spark SQL:
CALL nessie.system.expire_snapshots('nessie.robot_dev.robot_data', TIMESTAMP '2024-10-15 00:00:00.000', 1)
Count of snapshots reduced and manifest files have been deleted. But Nessie metadata maybe not sync the changes of snapshots
@snazy
@yunlou11 thanks for the information. But what are all the necessary steps to get to that error message? Aka, what was all done before running GC?
@yunlou11 thanks for the information. But what are all the necessary steps to get to that error message? Aka, what was all done before running GC?
Sorry, It's maybe Iceberg error, not nessie:
spark.sql("call demo.system.remove_orphan_files(table => 'xxx').show()")
iceberg issue: https://github.com/apache/iceberg/pull/7914
@yunlou11 thanks for the information. But what are all the necessary steps to get to that error message? Aka, what was all done before running GC?
Sorry, It's maybe Iceberg error, not nessie:
spark.sql("call demo.system.remove_orphan_files(table => 'xxx').show()")iceberg issue: apache/iceberg#7914
When I fixed iceberg remove_orphan_files s3 Exception followed the suggestion "Use SupportsPrefixOperations for Remove OrphanFile Procedure", I get the nessie-gc failure but It's bug of the fixed codes about iceberg , not nessie
Was this ever resolved? @dorsegal how did you fix the missing file reference?
@justas200 I didn't manage to fix it. I had to recreate Nessie catalog. It was painful. I am to afraid to test it again as I don't want to loose data.
I didn't had the time to create a test env to reproduce and write it here. But when it happened to me this is what I did Create iceberg table using Spark. Insert data to Iceberg table Run GC with GC command Use spark read ALL data.
Hi, any update on this ? Seems to happen quite frequently in the following scenario:
- Kafka to Iceberg (with Kafka connector)
- Periodically call to nessie.system.rewrite_data_files(...)
- Then call nessie-gc
Hi, any update on this ? Seems to happen quite frequently in the following scenario:
- Kafka to Iceberg (with Kafka connector)
- Periodically call to nessie.system.rewrite_data_files(...)
- Then call nessie-gc
I have the same setup. Never used the GC ever since...
@dorsegal FYI, looks like adding the max-file-modification parameter to the GC call may solve also your problem, from what I was able to understand file being written while gc was running were marked as deletable even if they are not
--max-file-modification="$(date -d '1 hours ago' +"%Y-%m-%dT%H:%M:%SZ")"
see also https://github.com/projectnessie/nessie/issues/10046
In case anyone comes across this issue and they lost actual parquet (or orc, etc) files, what I did was create a table with the same name and schema in the dev environment (or another schema should work) and inserted a single row with all null/whatever values (important that they don't match anything in the table you're fixing) into that table. Then, each time the query failed, I copied that parquet file from the dev environment to the path that the error was complaining about (ex. s3a://iceberg/logs/events_b7bb2c4b-c131-4745-b5cd-a7a71d305846/data/0000/0101/0011/10100000/ts_day=2025-05-16/kafka_partition=4/00001-1747379360621-4f80847e-ee2d-47b5-afc6-cd220fd1d36a-00002.parquet) until it didn't fail anymore. Then I ran a delete on the null/whatever values that I had used. This allowed me to get past query failures, albeit with data loss. Avoiding data loss would have been slightly more complex but possible - launch a new Kafka Connect consumer group for the topic that points at a temporary table, then use that temporary table to overwrite the partitions where there was data loss. Seems like that would have been necessary had I lost metadata files instead.