[Bug]: Optimization task is always in the SCHEDULED state
What happened?
Tasks 6, 7, and 8 remained in the scheduled state until I canceled.
Affects Versions
master
What engines are you seeing the problem on?
AMS
How to reproduce
This problem is indeed difficult to reproduce, but the logs can explain everything.
Relevant log output
2024-03-06 04:21:34,849 INFO [plan-executor-thread-185] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - hms_ctripdi.dwice_ctripdi.ods_fuxi_ubt_action_data_di(tableId=8781678) plan new task OptimizingTaskId(processId:1709670094130, taskId:6), summary MetricsSummary{rewriteDataSize=127030991, rewriteDataFileCnt=11, rewriteDataRecordCnt=309709, rewritePosDataSize=0, rewritePosDataFileCnt=0, reRowDeletedDataFileCnt=0, rewritePosDataRecordCnt=0, equalityDeleteSize=0, eqDeleteFileCnt=0, eqDeleteRecordCnt=0, positionDeleteSize=0, positionalDeleteSize=0, posDeleteFileCnt=0, posDeleteRecordCnt=0, newFileSize=0, newFileCnt=0, newDataSize=0, newDataFileCnt=0, newDataRecordCnt=0, newDeleteSize=0, newDeleteFileCnt=0, newDeleteRecordCnt=0}
thread-14 拉取任务
2024-03-06 04:21:34,873 INFO [thrift-server-optimize-manager-1955] [com.netease.arctic.server.DefaultOptimizingService] [] - OptimizerThread
OptimizerThread{threadId=14, optimizer=OptimizerInstance{token=33afe56d-10bb-4dc5-9c9b-b57b763f982f, startTime=1709208531000, touchTime=1709670088755}} polled task OptimizingTaskId(processId:1709670094130, taskId:6)
thread-14 ACK任务
2024-03-06 04:21:34,884 INFO [thrift-server-optimize-manager-1984] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1709670094130, taskId:6) by optimizer 33afe56d-10bb-4dc5-9c9b-b57b763f982f (threadId 14)
任务过期 重新放回队列
2024-03-06 04:23:21,956 INFO [optimizer-keeper-thread] [com.netease.arctic.server.DefaultOptimizingService] [] - Task OptimizingTaskId(processId:1709670094130, taskId:6) is suspending, since it's optimizer is expired, put it to retry queue, optimizer 33afe56d-10bb-4dc5-9c9b-b57b763f982f:14
注销optimizer
2024-03-06 04:23:22,017 INFO [optimizer-keeper-thread] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer OptimizerInstance{token=33afe56d-10bb-4dc5-9c9b-b57b763f982f, startTime=1709208531000, touchTime=1709670108771} has been expired, unregister it
thread-1 拉取任务(新的 threadId-1 和老的 token 入库)
2024-03-06 04:23:22,018 INFO [thrift-server-optimize-manager-1993] [com.netease.arctic.server.DefaultOptimizingService] [] - OptimizerThread OptimizerThread{threadId=1, optimizer=OptimizerInstance{token=33afe56d-10bb-4dc5-9c9b-b57b763f982f, startTime=1709208531000, touchTime=1709670108771}} polled task OptimizingTaskId(processId:1709670094130, taskId:6)
thread-1 ACK任务
2024-03-06 04:23:22,028 INFO [thrift-server-optimize-manager-1993] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1709670094130, taskId:6) by optimizer 33afe56d-10bb-4dc5-9c9b-b57b763f982f (threadId 1)
thread-1 ACK任务失败
2024-03-06 04:23:22,034 ERROR [thrift-server-optimize-manager-1993] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.ackTask execute failed
com.netease.arctic.server.exception.PluginRetryAuthException: Optimizer has not been authenticated
at com.netease.arctic.server.DefaultOptimizingService.lambda$getQueueByToken$12(DefaultOptimizingService.java:271) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_291]
at com.netease.arctic.server.DefaultOptimizingService.getQueueByToken(DefaultOptimizingService.java:271) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.DefaultOptimizingService.ackTask(DefaultOptimizingService.java:212) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at sun.reflect.GeneratedMethodAccessor144.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_291]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_291]
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56) [amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.sun.proxy.$Proxy49.ackTask(Unknown Source) [?:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$ackTask.getResult(OptimizingService.java:554) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$ackTask.getResult(OptimizingService.java:533) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.Invocation.run(Invocation.java:18) [libthrift-0.13.0.jar:0.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
注册新optimizer
2024-03-06 04:24:18,954 INFO [thrift-server-optimize-manager-1973] [com.netease.arctic.server.DefaultOptimizingService] [] - Register optimizer OptimizerRegisterInfo(threadCount:6, memoryMb:20480, startTime:1708588406699, groupName:bicorp-default, properties:{job-id=2b0c8990a8c7e82a3244576956829c93, heart-beat-interval=10000}).
thread-14 完成任务
2024-03-06 04:24:20,921 INFO [thrift-server-optimize-manager-2042] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 33afe56d-10bb-4dc5-9c9b-b57b763f982f complete task OptimizingTaskId(processId:1709670094130, taskId:6)
thread-14 完成失败
2024-03-06 04:24:20,921 ERROR [thrift-server-optimize-manager-2042] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.PluginRetryAuthException: Optimizer has not been authenticated
at com.netease.arctic.server.DefaultOptimizingService.lambda$getQueueByToken$12(DefaultOptimizingService.java:271) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_291]
at com.netease.arctic.server.DefaultOptimizingService.getQueueByToken(DefaultOptimizingService.java:271) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:221) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at sun.reflect.GeneratedMethodAccessor233.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_291]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_291]
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56) [amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.sun.proxy.$Proxy49.completeTask(Unknown Source) [?:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.Invocation.run(Invocation.java:18) [libthrift-0.13.0.jar:0.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
thread-14 再次完成任务
2024-03-06 04:24:20,931 INFO [thrift-server-optimize-manager-2025] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 8eb14549-b3b6-41a4-a17d-3b91c3df5d38 complete task OptimizingTaskId(processId:1709670094130, taskId:6)
thread-14 再次完成失败
2024-03-06 04:24:20,931 ERROR [thrift-server-optimize-manager-2025] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.DuplicateRuntimeException: Task already acked by optimizer thread + OptimizerThread{threadId=14, optimizer=OptimizerInstance{token=8eb14549-b3b6-41a4-a17d-3b91c3df5d38, startTime=1709208531459, touchTime=1709670259441}}
at com.netease.arctic.server.optimizing.TaskRuntime.validThread(TaskRuntime.java:305) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:78) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_291]
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_291]
at com.netease.arctic.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:61) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.persistence.StatedPersistentBase.invokeConsisitency(StatedPersistentBase.java:52) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.optimizing.TaskRuntime.complete(TaskRuntime.java:76) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:226) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at sun.reflect.GeneratedMethodAccessor233.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_291]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_291]
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56) [amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.sun.proxy.$Proxy49.completeTask(Unknown Source) [?:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.Invocation.run(Invocation.java:18) [libthrift-0.13.0.jar:0.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
thread-1 ACK 任务
2024-03-06 04:24:22,228 INFO [thrift-server-optimize-manager-2076] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1709670094130, taskId:6) by optimizer 8eb14549-b3b6-41a4-a17d-3b91c3df5d38 (threadId 1)
thread-1 ACK 失败
2024-03-06 04:24:22,228 ERROR [thrift-server-optimize-manager-2076] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.ackTask execute failed
com.netease.arctic.server.exception.DuplicateRuntimeException: Task already acked by optimizer thread + OptimizerThread{threadId=1, optimizer=OptimizerInstance{token=8eb14549-b3b6-41a4-a17d-3b91c3df5d38, startTime=1709208531459, touchTime=1709670259441}}
at com.netease.arctic.server.optimizing.TaskRuntime.validThread(TaskRuntime.java:305) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.optimizing.TaskRuntime.lambda$ack$3(TaskRuntime.java:139) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_291]
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_291]
at com.netease.arctic.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:61) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.persistence.StatedPersistentBase.invokeConsisitency(StatedPersistentBase.java:52) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.optimizing.TaskRuntime.ack(TaskRuntime.java:137) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.DefaultOptimizingService.ackTask(DefaultOptimizingService.java:215) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at sun.reflect.GeneratedMethodAccessor144.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_291]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_291]
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56) [amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.sun.proxy.$Proxy49.ackTask(Unknown Source) [?:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$ackTask.getResult(OptimizingService.java:554) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$ackTask.getResult(OptimizingService.java:533) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.Invocation.run(Invocation.java:18) [libthrift-0.13.0.jar:0.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
Anything else
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hey, thanks for feeding back this issue.
Can you explain more about this log? What really happed here: thread-1 拉取任务(新的 threadId-1 和老的 token 入库)
I don't get it.
Hey, thanks for feeding back this issue.
Can you explain more about this log? What really happed here: thread-1 拉取任务(新的 threadId-1 和老的 token 入库)
I don't get it.
Although unregisterOptimizer appears in the logs before polledTask, in reality, polledTask maybe occur between LOG.info("Optimizer {} has been expired, unregister it", keepingTask.getOptimizer()) and unregisterOptimizer(token). In this scenario, polledTask will use the new threadId and expired token.
The real timeline may be like this
2024-03-06 04:23:21,956 retryTask
2024-03-06 04:23:22,017 LOG.info("Optimizer {} has been expired, unregister it", keepingTask.getOptimizer());
2024-03-06 04:23:22,018 polledTask(new thread and expired token)
then unregisterOptimizer
2024-03-06 04:23:22,018 ackTask
2024-03-06 04:23:22,034 ack failed
If we can execute unregisterOptimizer before retryTask, we can ensure that optimizer using expired tokens cannot fetch new tasks.
I see a completeTask call after ack failed.
Is this possible?
I mean pollTask could occur any time, I beleve what you've done has fixed this issue. However I think the logic here is still confusing
thread-1 ACK 失败
2024-03-06 04:24:22,228 ERROR [thrift-server-optimize-manager-2076] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.ackTask execute failed
com.netease.arctic.server.exception.DuplicateRuntimeException: Task already acked by optimizer thread + OptimizerThread{threadId=1, optimizer=OptimizerInstance{token=8eb14549-b3b6-41a4-a17d-3b91c3df5d38, startTime=1709208531459, touchTime=1709670259441}}
at com.netease.arctic.server.optimizing.TaskRuntime.validThread(TaskRuntime.java:305) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.optimizing.TaskRuntime.lambda$ack$3(TaskRuntime.java:139) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_291]
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[?:1.8.0_291]
at com.netease.arctic.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:61) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.persistence.StatedPersistentBase.invokeConsisitency(StatedPersistentBase.java:52) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.optimizing.TaskRuntime.ack(TaskRuntime.java:137) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.server.DefaultOptimizingService.ackTask(DefaultOptimizingService.java:215) ~[amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at sun.reflect.GeneratedMethodAccessor144.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_291]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_291]
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56) [amoro-ams-server-0.7.0-SNAPSHOT.jar:?]
at com.sun.proxy.$Proxy49.ackTask(Unknown Source) [?:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$ackTask.getResult(OptimizingService.java:554) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$ackTask.getResult(OptimizingService.java:533) [amoro-ams-api-0.7.0-SNAPSHOT.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.Invocation.run(Invocation.java:18) [libthrift-0.13.0.jar:0.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_291]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_291]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_291]
And here return failed when the same thread ack again, this is strange!