[Bug]: S3 paths with special characters are not escaped
What happened
Maybe related to https://github.com/projectnessie/nessie/issues/8524
In our deployment we had a user that created a namespace with a questionmark in its name. Since then we cannot run gc anymore. I managed to reproduce it in our tests. GC It fails with:
Caused by: java.lang.RuntimeException: Failed to get manifest files for ICEBERG_TABLE tests.a.C?mpl Name .table, content-ID 0749c4e9-ac98-4c82-ad2f-1bc387f5045f at commit 4696d1290522386c8ee4db51c64c1fd3ed567351a0026a5fefb6154faa1629d2 via s3://wyrm-lakehouse-pytest/tests.a.C?mpl Name /table_2559f350-e7c7-4216-b29e-5033091a4d06/metadata/00000-6cc479b6-0667-4c01-93e8-d59ed46c4c4c.metadata.json
...
Caused by: org.apache.iceberg.exceptions.RuntimeIOException: Failed to open file: s3://wyrm-lakehouse-pytest/tests.a.C?mpl Name /table_2559f350-e7c7-4216-b29e-5033091a4d06/metadata/snap-6014480353781191283-1-fe723339-fd23-4db3-a534-95f1de851af9.avro
The file exists. It seems to me that the ? is not escaped, and thus the rest of the path is interpreted as a query argument.
Is there a flag to make gc altogether more robust? I don't want to abort it halfway through due to a single table failure.
How to reproduce it
- Create a namespace "a" with a subnamespace "C?mpl3 Name " (I used spark)
- Create a table in it called "table"
- Drop the table
- Run Nessie gc
Nessie server type (docker/uber-jar/built from source) and version
docker
Client type (Ex: UI/Spark/pynessie ...) and version
None - Nessie gc
Additional information
Nessie 0.99.0
Sorry, @c-thiel , I was not able to reproduce this problem with the current main code (local build) :thinking:
Could you share your full exception stack trace and exact gc CLI arguments (minus secret stuff, of course)?
Hm, strange.
Here is the callstack:
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.RuntimeException: Failed to get manifest files for ICEBERG_TABLE tests.2023-10-13-101415-edd293e4-1eee-47d7-b964-171340e13ae9.a.C?mpl Name .table, content-ID 0749c4e9-ac98-4c82-ad2f-1bc387f5045f at commit f150347568bc34bfd27b611d6588a00bf472e7bd76c477b3b96864deddf3ed5d via s3://my-bucket/tests.2023-10-13-101415-edd293e4-1eee-47d7-b964-171340e13ae9.a.C?mpl Name /table_2559f350-e7c7-4216-b29e-5033091a4d06/metadata/00000-6cc479b6-0667-4c01-93e8-d59ed46c4c4c.metadata.json
at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:540)
at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
at java.base/java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:653)
at java.base/java.util.concurrent.ForkJoinPool.invoke(ForkJoinPool.java:2822)
at org.projectnessie.gc.expire.local.DefaultLocalExpire.expire(DefaultLocalExpire.java:75)
at org.projectnessie.gc.tool.cli.commands.BaseRepositoryCommand.expire(BaseRepositoryCommand.java:250)
at org.projectnessie.gc.tool.cli.commands.MarkAndSweep.call(MarkAndSweep.java:57)
at org.projectnessie.gc.tool.cli.commands.BaseRepositoryCommand.call(BaseRepositoryCommand.java:84)
at org.projectnessie.gc.tool.cli.commands.BaseCommand.call(BaseCommand.java:26)
at org.projectnessie.gc.tool.cli.commands.BaseCommand.call(BaseCommand.java:21)
at picocli.CommandLine.executeUserObject(CommandLine.java:2045)
at picocli.CommandLine.access$1500(CommandLine.java:148)
at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2465)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2457)
at picocli.CommandLine$RunLast.handle(CommandLine.java:2419)
at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2277)
at picocli.CommandLine$RunLast.execute(CommandLine.java:2421)
at picocli.CommandLine.execute(CommandLine.java:2174)
at org.projectnessie.gc.tool.cli.CLI.runMain(CLI.java:104)
at org.projectnessie.gc.tool.cli.CLI.runMain(CLI.java:75)
at org.projectnessie.gc.tool.cli.CLI.main(CLI.java:70)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Failed to get manifest files for ICEBERG_TABLE tests.2023-10-13-101415-edd293e4-1eee-47d7-b964-171340e13ae9.a.C?mpl Name .table, content-ID 0749c4e9-ac98-4c82-ad2f-1bc387f5045f at commit f150347568bc34bfd27b611d6588a00bf472e7bd76c477b3b96864deddf3ed5d via s3://my-bucket/tests.2023-10-13-101415-edd293e4-1eee-47d7-b964-171340e13ae9.a.C?mpl Name /table_2559f350-e7c7-4216-b29e-5033091a4d06/metadata/00000-6cc479b6-0667-4c01-93e8-d59ed46c4c4c.metadata.json
at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
at java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:540)
at java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
at java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:670)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:927)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.base/java.util.stream.ReferencePipeline.reduce(ReferencePipeline.java:657)
at org.projectnessie.gc.expire.local.DefaultLocalExpire.expireInForkJoinPool(DefaultLocalExpire.java:96)
at java.base/java.util.concurrent.ForkJoinTask$AdaptedCallable.exec(ForkJoinTask.java:1456)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Caused by: java.lang.RuntimeException: Failed to get manifest files for ICEBERG_TABLE tests.2023-10-13-101415-edd293e4-1eee-47d7-b964-171340e13ae9.a.C?mpl Name .table, content-ID 0749c4e9-ac98-4c82-ad2f-1bc387f5045f at commit f150347568bc34bfd27b611d6588a00bf472e7bd76c477b3b96864deddf3ed5d via s3://my-bucket/tests.2023-10-13-101415-edd293e4-1eee-47d7-b964-171340e13ae9.a.C?mpl Name /table_2559f350-e7c7-4216-b29e-5033091a4d06/metadata/00000-6cc479b6-0667-4c01-93e8-d59ed46c4c4c.metadata.json
at org.projectnessie.gc.iceberg.IcebergContentToFiles.lambda$extractTableFiles$0(IcebergContentToFiles.java:170)
at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:273)
at java.base/java.util.stream.Streams$StreamBuilderImpl.forEachRemaining(Streams.java:411)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:310)
at java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:735)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
at java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:276)
at java.base/java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1715)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:921)
at java.base/java.util.stream.ReduceOps$5.evaluateSequential(ReduceOps.java:258)
at java.base/java.util.stream.ReduceOps$5.evaluateSequential(ReduceOps.java:248)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.count(ReferencePipeline.java:709)
at org.projectnessie.gc.expire.PerContentDeleteExpired.identifyLiveFiles(PerContentDeleteExpired.java:136)
at org.projectnessie.gc.expire.PerContentDeleteExpired.expire(PerContentDeleteExpired.java:73)
at org.projectnessie.gc.expire.local.DefaultLocalExpire.expireSingleContent(DefaultLocalExpire.java:109)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
at java.base/java.util.concurrent.ConcurrentHashMap$KeySpliterator.forEachRemaining(ConcurrentHashMap.java:3573)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:960)
at java.base/java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:934)
at java.base/java.util.stream.AbstractTask.compute(AbstractTask.java:327)
at java.base/java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
... 5 more
Caused by: org.apache.iceberg.exceptions.RuntimeIOException: Failed to open file: s3://my-bucket/tests.2023-10-13-101415-edd293e4-1eee-47d7-b964-171340e13ae9.a.C?mpl Name /table_2559f350-e7c7-4216-b29e-5033091a4d06/metadata/snap-6014480353781191283-1-fe723339-fd23-4db3-a534-95f1de851af9.avro
at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:104)
at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:77)
at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:37)
at org.apache.iceberg.relocated.com.google.common.collect.Iterables.addAll(Iterables.java:333)
at org.apache.iceberg.relocated.com.google.common.collect.Lists.newLinkedList(Lists.java:242)
at org.apache.iceberg.ManifestLists.read(ManifestLists.java:45)
at org.apache.iceberg.BaseSnapshot.cacheManifests(BaseSnapshot.java:146)
at org.apache.iceberg.BaseSnapshot.allManifests(BaseSnapshot.java:164)
at org.projectnessie.gc.iceberg.IcebergContentToFiles.allManifests(IcebergContentToFiles.java:268)
at org.projectnessie.gc.iceberg.IcebergContentToFiles.allManifestsAndDataFiles(IcebergContentToFiles.java:254)
at org.projectnessie.gc.iceberg.IcebergContentToFiles.lambda$extractTableFiles$0(IcebergContentToFiles.java:155)
... 37 more
Caused by: org.apache.avro.InvalidAvroMagicException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:78)
at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:102)
... 47 more
The gc command is:
- gc
- --uri
- http://....:19120/api/v2
- --inmemory
- --default-cutoff
- PT80H
- --iceberg
- $(S3_CREDENTIALS)
The credential has access to the file. I am on Version 0.99 - if main doesn't have the problem, we could just try an update and see if the problem persists.
@c-thiel : From the stack trace it looks like it is not a problem with S3 URI handling, but might be a problem with the file contents. Could you double check in S3 UI/CLI whether that avro file is correct?
I tried Spark + AWS and created the top-level namespace with ? and a in the name.
Then Iceberg fails to insert on the Spark side.
spark-sql ()> create namespace `ns?test abc`;
then
use `ns?test abc`
then
spark-sql (`ns?test abc`)> create table t2 (n string)
> ;
Time taken: 1.609 seconds
spark-sql (`ns?test abc`)> insert into t2 values ('t2');
25/01/08 10:57:35 WARN SnapshotProducer: Failed to load committed table metadata or during cleanup, skipping further cleanup
org.apache.iceberg.exceptions.RuntimeIOException: Failed to open file: s3://EXAMPLE/ns?test abc/t2_11c36b10-a8c1-477a-8a91-b90733dad13f/metadata/snap-4983815506953660385-1-b650452a-4aba-45ae-b417-c9a3adc99462.avro
at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:104)
at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:77)
at org.apache.iceberg.avro.AvroIterable.iterator(AvroIterable.java:37)
at org.apache.iceberg.relocated.com.google.common.collect.Iterables.addAll(Iterables.java:333)
at org.apache.iceberg.relocated.com.google.common.collect.Lists.newLinkedList(Lists.java:258)
at org.apache.iceberg.ManifestLists.read(ManifestLists.java:45)
at org.apache.iceberg.BaseSnapshot.cacheManifests(BaseSnapshot.java:146)
at org.apache.iceberg.BaseSnapshot.allManifests(BaseSnapshot.java:164)
at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:451)
at org.apache.iceberg.spark.source.SparkWrite.commitOperation(SparkWrite.java:233)
at org.apache.iceberg.spark.source.SparkWrite$BatchAppend.commit(SparkWrite.java:301)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:399)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:359)
at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.writeWithV2(WriteToDataSourceV2Exec.scala:225)
at org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run(WriteToDataSourceV2Exec.scala:337)
at org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run$(WriteToDataSourceV2Exec.scala:336)
at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.run(WriteToDataSourceV2Exec.scala:225)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:220)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
at org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:713)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:744)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:651)
at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:68)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:501)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:619)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:613)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:613)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:310)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:75)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:52)
at java.base/java.lang.reflect.Method.invoke(Method.java:580)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1029)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:194)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:217)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1120)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1129)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.iceberg.shaded.org.apache.avro.InvalidAvroMagicException: Not an Avro data file
at org.apache.iceberg.shaded.org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:79)
at org.apache.iceberg.avro.AvroIterable.newFileReader(AvroIterable.java:102)
... 74 more
Time taken: 2.957 seconds
So, my guess is that it's a bug in Iceberg code :shrug:
@c-thiel : Does this use case work with a JDBC Iceberg catalog?
@c-thiel : Does this use case work with a JDBC Iceberg catalog?
I haven't tried actually. Thanks for investigating @dimas-b.
I don't really care much about this specific use-case, but if someone manages to create a broken table (very small fraction), it would be great if Nessie would continue with gc instead of aborting. Not really sure how I can get the gc to run normally again without touching MongoDB
Nessie GC uses Iceberg jars for metadata access very similarly to how Spark uses them.
So it is conceivable that, for example, Python might be able to write something that Iceberg java would not be able to read :shrug: