hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] The rollback failed because the file could not be created because the marker file already existed.

Open LmrZER0 opened this issue 1 year ago • 3 comments

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

  1. my config: hoodie.write.concurrency.mode=optimistic_concurrency_control hoodie.cleaner.policy.failed.writes=LAZY hoodie.write.concurrency.early.conflict.detection.enable=TRUE
  2. job no restart

Expected behavior

A clear and concise description of what you expected to happen. image

2024-08-13 11:06:01.598 ERROR [pool-258-thread-1:8-thread-1] org.apache.hudi.async.HoodieAsyncService - Service shutdown with error java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieRollbackException: Failed to rollback hdfs://ns1200/user/test/tmp.db/app_jdr_ads_dra_edm_user_behavior_content_hudi_a_d_d commits 20240811184332421 at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895) at org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103) at org.apache.hudi.async.AsyncCleanerService.waitForCompletion(AsyncCleanerService.java:75) at org.apache.hudi.client.BaseHoodieTableServiceClient.asyncClean(BaseHoodieTableServiceClient.java:132) at org.apache.hudi.client.HoodieFlinkWriteClient.waitForCleaningFinish(HoodieFlinkWriteClient.java:344) at org.apache.hudi.sink.CleanFunction.lambda$notifyCheckpointComplete$1(CleanFunction.java:84) at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieRollbackException: Failed to rollback hdfs://ns1200/user/test/tmp.db/app_jdr_ads_dra_edm_user_behavior_content_hudi_a_d_d commits 20240811184332421 at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1061) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1008) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:935) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:917) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollbackFailedWrites(BaseHoodieTableServiceClient.java:912) at org.apache.hudi.client.BaseHoodieTableServiceClient.lambda$clean$1cda88ee$1(BaseHoodieTableServiceClient.java:739) at org.apache.hudi.common.util.CleanerUtils.rollbackFailedWrites(CleanerUtils.java:214) at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:738) at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:843) at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:816) at org.apache.hudi.async.AsyncCleanerService.lambda$startService$0(AsyncCleanerService.java:55) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ... 3 common frames omitted Caused by: org.apache.hudi.exception.HoodieException: Error occurs when executing flatMap at org.apache.hudi.common.function.FunctionWrapper.lambda$throwingFlatMapWrapper$1(FunctionWrapper.java:50) at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747) at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721) at java.util.stream.AbstractTask.compute(AbstractTask.java:316) at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hudi.client.common.HoodieFlinkEngineContext.flatMap(HoodieFlinkEngineContext.java:141) at org.apache.hudi.table.action.rollback.BaseRollbackHelper.maybeDeleteAndCollectStats(BaseRollbackHelper.java:150) at org.apache.hudi.table.action.rollback.BaseRollbackHelper.performRollback(BaseRollbackHelper.java:115) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.executeRollback(BaseRollbackActionExecutor.java:245) at org.apache.hudi.table.action.rollback.MergeOnReadRollbackActionExecutor.executeRollback(MergeOnReadRollbackActionExecutor.java:87) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.doRollbackAndGetStats(BaseRollbackActionExecutor.java:227) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:111) at org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:141) at org.apache.hudi.table.HoodieFlinkMergeOnReadTable.rollback(HoodieFlinkMergeOnReadTable.java:158) at org.apache.hudi.client.BaseHoodieTableServiceClient.rollback(BaseHoodieTableServiceClient.java:1044) ... 14 common frames omitted Caused by: org.apache.hudi.exception.HoodieException: Failed to create marker file hdfs://ns1007/user/test/tmp.db/app_jdr_ads_dra_edm_user_behavior_content_hudi_a_d_d/.hoodie/.temp/20240811185848523/dt=2024-08-11/.00000168-778b-477d-b4ab-1417e067f08e_20240811182559380.log.1_13-64-0.marker.APPEND at org.apache.hudi.table.marker.DirectWriteMarkers.create(DirectWriteMarkers.java:264) at org.apache.hudi.table.marker.DirectWriteMarkers.createWithEarlyConflictDetection(DirectWriteMarkers.java:243) at org.apache.hudi.table.marker.WriteMarkers.createIfNotExists(WriteMarkers.java:135) at org.apache.hudi.table.action.rollback.BaseRollbackHelper$1.createAppendMarker(BaseRollbackHelper.java:251) at org.apache.hudi.table.action.rollback.BaseRollbackHelper$1.preLogFileOpen(BaseRollbackHelper.java:241) at org.apache.hudi.common.table.log.HoodieLogFormatWriter.getOutputStream(HoodieLogFormatWriter.java:100) at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlocks(HoodieLogFormatWriter.java:149) at org.apache.hudi.common.table.log.HoodieLogFormatWriter.appendBlock(HoodieLogFormatWriter.java:140) at org.apache.hudi.table.action.rollback.BaseRollbackHelper.lambda$maybeDeleteAndCollectStats$b2977713$1(BaseRollbackHelper.java:181) at org.apache.hudi.common.function.FunctionWrapper.lambda$throwingFlatMapWrapper$1(FunctionWrapper.java:48) ... 38 common frames omitted Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/test/tmp.db/app_jdr_ads_dra_edm_user_behavior_content_hudi_a_d_d/.hoodie/.temp/20240811185848523/dt=2024-08-11/.00000168-778b-477d-b4ab-1417e067f08e_20240811182559380.log.1_13-64-0.marker.APPEND for client 10.198.21.35 already exists at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.startFile(FSDirWriteFileOp.java:463) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2874) at org.apache.hadoop.hdfs.server.namenode.JDFSNamesystem.access$401(JDFSNamesystem.java:177) at org.apache.hadoop.hdfs.server.namenode.JDFSNamesystem$5.call(JDFSNamesystem.java:1494) at org.apache.hadoop.hdfs.server.namenode.JDFSNamesystem$5.call(JDFSNamesystem.java:1484) at org.apache.hadoop.hdfs.server.namenode.JDFSNamesystem$CoalesceWriteThread.run(JDFSNamesystem.java:1647)

Environment Description

  • Hudi version : 0.10.0

  • Spark version no

  • Hive version :no

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :hdfs

  • Running on Docker? (yes/no) :no

  • Flink version: 1.14

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

LmrZER0 avatar Aug 13 '24 04:08 LmrZER0

Do you have multiple jobs here? For lazy cleaning, only one cleaning is allowed now because the cleaning is not guarded by any lock currently, that means you can only enable cleaning for a singleton job.

danny0405 avatar Aug 14 '24 08:08 danny0405

@LmrZER0 Also, can you provide your full writer configurations?

ad1happy2go avatar Aug 14 '24 12:08 ad1happy2go

@LmrZER0 Will you be able to provide us required info to look into this further? Please let us know in case it got resolved.

ad1happy2go avatar Aug 22 '24 04:08 ad1happy2go

do you have spark speculation enabled by any chance?

nsivabalan avatar Sep 13 '24 19:09 nsivabalan

Even if the marker exists, we can still take the rollback, this might be an possible improvement.

danny0405 avatar Sep 14 '24 00:09 danny0405