nessie `ValidationException: Cannot set main to unknown snapshot` occurs when inserting with high concurrency?

`ValidationException: Cannot set main to unknown snapshot` occurs when inserting with high concurrency?

Open sxh-lsc opened this issue 10 months ago • 6 comments

Issue description

16:39:33.385 ERROR o.a.s.s.e.d.v2.AppendDataExec : Data source write support IcebergBatchWrite(table=xxxxxxxx, format=PARQUET) aborted.
org.apache.spark.SparkException: Writing job aborted
	at org.apache.spark.sql.errors.QueryExecutionErrors$.writingJobAbortedError(QueryExecutionErrors.scala:767)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:409)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:353)
	at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.writeWithV2(WriteToDataSourceV2Exec.scala:244)
	at org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run(WriteToDataSourceV2Exec.scala:332)
	at org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run$(WriteToDataSourceV2Exec.scala:331)
	at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.run(WriteToDataSourceV2Exec.scala:244)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:116)
	at org.apache.spark.sql.DataFrameWriterV2.runCommand(DataFrameWriterV2.scala:195)
	at org.apache.spark.sql.DataFrameWriterV2.append(DataFrameWriterV2.scala:149)
	at ai.weride.datalake_service.write_center.Insert$.$anonfun$writeToIceberg$3(Insert.scala:49)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:967)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1024)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:890)
	at zio.internal.FiberRuntime.evaluateEffect(FiberRuntime.scala:381)
	at zio.internal.FiberRuntime.evaluateMessageWhileSuspended(FiberRuntime.scala:504)
	at zio.internal.FiberRuntime.drainQueueOnCurrentThread(FiberRuntime.scala:220)
	at zio.internal.FiberRuntime.run(FiberRuntime.scala:139)
	at zio.internal.ZScheduler$$anon$4.run(ZScheduler.scala:478)
Caused by: org.apache.iceberg.exceptions.ValidationException: Cannot set main to unknown snapshot: 5891649337483373908
	at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49)
	at org.apache.iceberg.TableMetadata$Builder.setBranchSnapshot(TableMetadata.java:1165)
	at org.apache.iceberg.nessie.NessieTableOperations.loadTableMetadata(NessieTableOperations.java:104)
	at org.apache.iceberg.nessie.NessieTableOperations.lambda$doRefresh$1(NessieTableOperations.java:149)
	at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$1(BaseMetastoreTableOperations.java:208)
	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
	at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:208)
	at org.apache.iceberg.nessie.NessieTableOperations.doRefresh(NessieTableOperations.java:149)
	at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
	at org.apache.iceberg.SnapshotProducer.refresh(SnapshotProducer.java:345)
	at org.apache.iceberg.SnapshotProducer.apply(SnapshotProducer.java:210)
	at org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:366)
	at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
	at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
	at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
	at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:364)
	at org.apache.iceberg.spark.source.SparkWrite.commitOperation(SparkWrite.java:216)
	at org.apache.iceberg.spark.source.SparkWrite.access$1300(SparkWrite.java:83)
	at org.apache.iceberg.spark.source.SparkWrite$BatchAppend.commit(SparkWrite.java:279)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:392)
	... 47 more

This occurs in some high concurrency insertion scenarios. Generally speaking, it will retry until successful with using iceberg commit-retry, but there are still some scenarios where the retry still fails, especially in high concurrency scenarios

Apr 02 '24 03:04 sxh-lsc

nessie nessie copied to clipboard

`ValidationException: Cannot set main to unknown snapshot` occurs when inserting with high concurrency?

Issue description

nessie
nessie copied to clipboard