[Bug]: Optimizing process stuck after a few failures
What happened?
If an optimizing task fails constantly (e.g. because of corrupted files), the optimizing task would be stuck at the RUNNING state after a few retries, while the task has already failed at the optimizer side.
Affects Versions
0.5.1
What engines are you seeing the problem on?
AMS, Optimizer
How to reproduce
- create a table with some data that meets the optimizing conditions.
- truncate a data file that would take part in the optimizing compaction.
- AMS tries to optimize the table but fails and retries.
- after a few retries, AMS would lose track of the task, and leave the task stuck at RUNNING state.
Relevant log output
No response
Anything else
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for feeding back, this seems a scheduing issue.
Did you truncate data files between 18:09 and 24:09? or it happened before all these failures in the picture. It would really help if you could describe more
If there are logs would help more
@majin1102 Thanks for your reply!
The table has been corrupted since the very beginning. The failed executions were for the same reason.
I'm digging into more details.
The AMS log shows that the completed task was not found. No leader election happened at that period. It looks like a bug in the task lifecycle.
2024-03-11 00:24:09,755 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:1), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134213316, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=31, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,756 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:2), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134095612, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=27, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,757 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:3), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134166571, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=21, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,758 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:4), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134215164, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=20, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,759 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:5), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134213859, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=17, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,760 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:6), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134129656, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=17, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,761 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:7), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134131430, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=24, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,762 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:8), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134144945, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=19, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,762 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:9), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=132981844, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=32, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,763 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:10), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=50579960, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=7, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 6) polled task OptimizingTaskId(processId:1710087849397, taskId:3)
2024-03-11 00:24:09,870 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:3) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 6)
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 3) polled task OptimizingTaskId(processId:1710087849397, taskId:5)
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 9) polled task OptimizingTaskId(processId:1710087849397, taskId:2)
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 4) polled task OptimizingTaskId(processId:1710087849397, taskId:4)
2024-03-11 00:24:09,879 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:5) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 3)
2024-03-11 00:24:09,879 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:2) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 9)
2024-03-11 00:24:09,880 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:4) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 4)
2024-03-11 00:24:10,525 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 5) polled task OptimizingTaskId(processId:1710087849397, taskId:7)
2024-03-11 00:24:10,529 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:7) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 5)
2024-03-11 00:24:10,831 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 1) polled task OptimizingTaskId(processId:1710087849397, taskId:6)
2024-03-11 00:24:10,836 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:6) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 1)
2024-03-11 00:24:11,006 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 0) polled task OptimizingTaskId(processId:1710087849397, taskId:9)
2024-03-11 00:24:11,011 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:9) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 0)
2024-03-11 00:24:11,997 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 2) polled task OptimizingTaskId(processId:1710087849397, taskId:8)
2024-03-11 00:24:12,002 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:8) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 2)
2024-03-11 00:24:12,670 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 7) polled task OptimizingTaskId(processId:1710087849397, taskId:10)
2024-03-11 00:24:12,689 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:10) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 7)
2024-03-11 00:24:13,766 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:13,770 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:17,065 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:17,122 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:17,126 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:20,337 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:20,393 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:20,398 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:21,517 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:10)
2024-03-11 00:24:23,596 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:23,653 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:23,657 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:26,908 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:26,966 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:26,970 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:30,496 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:30,554 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:30,558 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:33,765 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:33,822 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:33,825 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:34,586 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:9)
2024-03-11 00:24:34,822 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:5)
2024-03-11 00:24:34,835 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:7)
2024-03-11 00:24:35,017 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:6)
2024-03-11 00:24:37,232 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:37,473 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:4)
2024-03-11 00:24:37,589 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:3)
2024-03-11 00:24:37,590 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:3) not found.
at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202]
at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2024-03-11 00:24:38,215 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:2)
2024-03-11 00:24:38,218 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:2) not found.
at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202]
at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2024-03-11 00:24:41,672 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:8)2024-03-11 00:24:41,673 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:8) not found.
at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202]
at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0]
at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
A solution would be to fail the current process if a TaskNotFoundException happens. Otherwise, the process would be stuck forever. We could file another issue to further dig into and resolve the unexcepted TaskNotFoundException. WDYT? @majin1102
Thanks so much for the further details. The log really narrows situations.
Like you said, apparently the task has lost when optimizer completes the task(the tasks not encounter corrupted file). The actual happening here is not process retry, instead every failure is a single planned process. So If the loss happens after constant times of failure, it doesn't make sense cause each process is independent.
I wonder if there are expiring optimizer events in the log, or any other optimizer related logs could trigger task clean operations. Please check and let me know, meanwhile I would try to reproduce it in my own evironment.
WDYT
If TaskNotFoundException occurs, that should mean the process lifecyle has been ended. That's the design and a incosistency problem apparently occurs here. I think we'd better figure out the incosistency (what and how) and determine the solution.
Anyway I am open to fire another issue to optimize or resolve this situaion if we figure out the problem.
Hey, I checked the stack and found that the code here has been deeply refactored.
I wonder if 0.5.1 has encountered similar inconsistency feed back? @zhoujinsong
The AMS log shows that the completed task was not found. No leader election happened at that period. It looks like a bug in the task lifecycle.
2024-03-11 00:24:09,755 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:1), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134213316, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=31, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,756 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:2), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134095612, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=27, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,757 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:3), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134166571, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=21, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,758 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:4), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134215164, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=20, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,759 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:5), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134213859, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=17, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,760 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:6), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134129656, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=17, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,761 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:7), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134131430, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=24, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,762 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:8), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134144945, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=19, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,762 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:9), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=132981844, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=32, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,763 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:10), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=50579960, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=7, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0} 2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 6) polled task OptimizingTaskId(processId:1710087849397, taskId:3) 2024-03-11 00:24:09,870 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:3) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 6) 2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 3) polled task OptimizingTaskId(processId:1710087849397, taskId:5) 2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 9) polled task OptimizingTaskId(processId:1710087849397, taskId:2) 2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 4) polled task OptimizingTaskId(processId:1710087849397, taskId:4) 2024-03-11 00:24:09,879 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:5) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 3) 2024-03-11 00:24:09,879 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:2) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 9) 2024-03-11 00:24:09,880 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:4) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 4) 2024-03-11 00:24:10,525 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 5) polled task OptimizingTaskId(processId:1710087849397, taskId:7) 2024-03-11 00:24:10,529 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:7) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 5) 2024-03-11 00:24:10,831 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 1) polled task OptimizingTaskId(processId:1710087849397, taskId:6) 2024-03-11 00:24:10,836 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:6) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 1) 2024-03-11 00:24:11,006 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 0) polled task OptimizingTaskId(processId:1710087849397, taskId:9) 2024-03-11 00:24:11,011 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:9) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 0) 2024-03-11 00:24:11,997 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 2) polled task OptimizingTaskId(processId:1710087849397, taskId:8) 2024-03-11 00:24:12,002 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:8) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 2) 2024-03-11 00:24:12,670 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 7) polled task OptimizingTaskId(processId:1710087849397, taskId:10) 2024-03-11 00:24:12,689 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:10) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 7) 2024-03-11 00:24:13,766 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:13,770 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) 2024-03-11 00:24:17,065 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:17,122 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:17,126 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) 2024-03-11 00:24:20,337 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:20,393 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:20,398 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) 2024-03-11 00:24:21,517 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:10) 2024-03-11 00:24:23,596 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:23,653 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:23,657 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) 2024-03-11 00:24:26,908 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:26,966 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:26,970 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) 2024-03-11 00:24:30,496 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:30,554 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:30,558 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) 2024-03-11 00:24:33,765 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:33,822 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:33,825 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) 2024-03-11 00:24:34,586 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:9) 2024-03-11 00:24:34,822 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:5) 2024-03-11 00:24:34,835 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:7) 2024-03-11 00:24:35,017 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:6) 2024-03-11 00:24:37,232 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1) 2024-03-11 00:24:37,473 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:4) 2024-03-11 00:24:37,589 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:3) 2024-03-11 00:24:37,590 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:3) not found. at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202] at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202] at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?] at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?] at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202] 2024-03-11 00:24:38,215 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:2) 2024-03-11 00:24:38,218 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:2) not found. at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202] at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202] at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?] at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?] at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202] 2024-03-11 00:24:41,672 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:8)2024-03-11 00:24:41,673 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:8) not found. at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202] at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202] at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?] at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?] at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?] at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0] at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
Could you search two pattern logs:
LOG.info("Task {} is canceled, remove it from executing task map", task.getTaskId());
LOG.info("Task {} is suspending, since it's optimizer is expired, put it to retry queue, optimizer {}",
task.getTaskId(), task.getOptimizingThread());
@majin1102 The logs I provided were found by the process id, so sadly there are no logs suggesting the task is removed. I double-checked the logs as you said and there was an optimizer expiring at 5 am, which is not relevant.
I'm trying to reproduce it again in the testing environment to get more information. Please let me know if you have any ideas. Thanks! @majin1102 @zhoujinsong
I've probably found the root cause. If a task fails and exhausts the max retries (default to 5), AMS will try to cancel all tasks (which causes TaskNotFoundException if not succeeds) and fail the process.
https://github.com/NetEase/amoro/blob/0e7d291955147545cbf88200087bfe572d8bbb9c/ams/server/src/main/java/com/netease/arctic/server/optimizing/OptimizingQueue.java#L446-L450
However, in my case, the cancellation process didn't work somehow, leaving the unfinished tasks at ACKED status and the process at RUNNING status.
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+
| process_id | task_id | table_id | catalog_name | db_name | table_name | status | retry_num | start_time | end_time |
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+
| 1710087849397 | 1 | 53865906 | catalog | db | table | FAILED | 6 | 2024-03-11 00:24:34 | 2024-03-11 00:24:37 |
| 1710087849397 | 2 | 53865906 | catalog | db | table | ACKED | 0 | 2024-03-11 00:24:10 | NULL |
| 1710087849397 | 3 | 53865906 | catalog | db | table | ACKED | 0 | 2024-03-11 00:24:10 | NULL |
| 1710087849397 | 4 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:37 |
| 1710087849397 | 5 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:35 |
| 1710087849397 | 6 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 | 7 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 | 8 | 53865906 | catalog | db | table | ACKED | 0 | 2024-03-11 00:24:12 | NULL |
| 1710087849397 | 9 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 | 10 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:13 | 2024-03-11 00:24:22 |
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+
In contrast, the previous failed process looks like this:
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
| process_id | task_id | table_id | catalog_name | db_name | table_name | status | retry_num | start_time | end_time |
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
| 1710087488523 | 1 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 |
| 1710087488523 | 2 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:35 |
| 1710087488523 | 3 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:34 |
| 1710087488523 | 4 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 |
| 1710087488523 | 5 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:40 |
| 1710087488523 | 6 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:58 |
| 1710087488523 | 7 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:45 |
| 1710087488523 | 8 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:12 | 2024-03-11 00:18:38 |
| 1710087488523 | 9 | 53865906 | catalog | db | table | CANCELED | 6 | 2024-03-11 00:21:06 | 2024-03-11 00:21:19 |
| 1710087488523 | 10 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:13 | 2024-03-11 00:18:24 |
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
I've probably found the root cause. If a task fails and exhausts the max retries (default to 5), AMS will try to cancel all tasks (which causes TaskNotFoundException if not succeeds) and fail the process.
https://github.com/NetEase/amoro/blob/0e7d291955147545cbf88200087bfe572d8bbb9c/ams/server/src/main/java/com/netease/arctic/server/optimizing/OptimizingQueue.java#L446-L450
However, in my case, the cancellation process didn't work somehow, leaving the unfinished tasks at ACKED status and the process at RUNNING status.
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+ | process_id | task_id | table_id | catalog_name | db_name | table_name | status | retry_num | start_time | end_time | +---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+ | 1710087849397 | 1 | 53865906 | catalog | db | table | FAILED | 6 | 2024-03-11 00:24:34 | 2024-03-11 00:24:37 | | 1710087849397 | 2 | 53865906 | catalog | db | table | ACKED | 0 | 2024-03-11 00:24:10 | NULL | | 1710087849397 | 3 | 53865906 | catalog | db | table | ACKED | 0 | 2024-03-11 00:24:10 | NULL | | 1710087849397 | 4 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:37 | | 1710087849397 | 5 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:35 | | 1710087849397 | 6 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 | | 1710087849397 | 7 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 | | 1710087849397 | 8 | 53865906 | catalog | db | table | ACKED | 0 | 2024-03-11 00:24:12 | NULL | | 1710087849397 | 9 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 | | 1710087849397 | 10 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:13 | 2024-03-11 00:24:22 | +---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+In contrast, the previous failed process looks like this:
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+ | process_id | task_id | table_id | catalog_name | db_name | table_name | status | retry_num | start_time | end_time | +---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+ | 1710087488523 | 1 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 | | 1710087488523 | 2 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:35 | | 1710087488523 | 3 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:34 | | 1710087488523 | 4 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 | | 1710087488523 | 5 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:40 | | 1710087488523 | 6 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:58 | | 1710087488523 | 7 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:45 | | 1710087488523 | 8 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:12 | 2024-03-11 00:18:38 | | 1710087488523 | 9 | 53865906 | catalog | db | table | CANCELED | 6 | 2024-03-11 00:21:06 | 2024-03-11 00:21:19 | | 1710087488523 | 10 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:13 | 2024-03-11 00:18:24 | +---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
Hi, Link.
You really did a good job for reaching here. I've checked the code and found the underlying bug. The memory operation should follow persisting operation which is reversed in the code you posted
Bug I think you should be able to locate a log like 'LOG.error("accept result error:", e)' That will make sense finally and the issue should be very accidental when database encountered an error. Could you check the log again?
Thanks
@majin1102 Weird enough, I couldn't find the related logs you said.
I tried to reproduce it in the test environment, but AMS works as expected even after 30+ failed process. I assume it's related to the workload, as I have 100+ tables in the test environment but 10k in production.
FYI, I find our flink optimizers have a memory leak that's probably caused by glibc and get OOM killed sometimes. That may also contribute to the inconsistency problem.
I've probably found the root cause. If a task fails and exhausts the max retries (default to 5), AMS will try to cancel all tasks (which causes TaskNotFoundException if not succeeds) and fail the process.
https://github.com/NetEase/amoro/blob/0e7d291955147545cbf88200087bfe572d8bbb9c/ams/server/src/main/java/com/netease/arctic/server/optimizing/OptimizingQueue.java#L446-L450
However, in my case, the cancellation process didn't work somehow, leaving the unfinished tasks at ACKED status and the process at RUNNING status.
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+ | process_id | task_id | table_id | catalog_name | db_name | table_name | status | retry_num | start_time | end_time | +---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+ | 1710087849397 | 1 | 53865906 | catalog | db | table | FAILED | 6 | 2024-03-11 00:24:34 | 2024-03-11 00:24:37 | | 1710087849397 | 2 | 53865906 | catalog | db | table | ACKED | 0 | 2024-03-11 00:24:10 | NULL | | 1710087849397 | 3 | 53865906 | catalog | db | table | ACKED | 0 | 2024-03-11 00:24:10 | NULL | | 1710087849397 | 4 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:37 | | 1710087849397 | 5 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:35 | | 1710087849397 | 6 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 | | 1710087849397 | 7 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 | | 1710087849397 | 8 | 53865906 | catalog | db | table | ACKED | 0 | 2024-03-11 00:24:12 | NULL | | 1710087849397 | 9 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 | | 1710087849397 | 10 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:24:13 | 2024-03-11 00:24:22 | +---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+In contrast, the previous failed process looks like this:
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+ | process_id | task_id | table_id | catalog_name | db_name | table_name | status | retry_num | start_time | end_time | +---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+ | 1710087488523 | 1 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 | | 1710087488523 | 2 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:35 | | 1710087488523 | 3 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:34 | | 1710087488523 | 4 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 | | 1710087488523 | 5 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:40 | | 1710087488523 | 6 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:58 | | 1710087488523 | 7 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:45 | | 1710087488523 | 8 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:12 | 2024-03-11 00:18:38 | | 1710087488523 | 9 | 53865906 | catalog | db | table | CANCELED | 6 | 2024-03-11 00:21:06 | 2024-03-11 00:21:19 | | 1710087488523 | 10 | 53865906 | catalog | db | table | SUCCESS | 0 | 2024-03-11 00:18:13 | 2024-03-11 00:18:24 | +---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
However, in my case, the cancellation process didn't work somehow, leaving the unfinished tasks at ACKED status and the process at RUNNING status.
You can check if there is a deadlock in jstack. I have also encountered the same problem.
For me, this is why the state is not being updated.
"thrift-server-optimize-manager-20575" Id=646912 WAITING on java.util.concurrent.locks.ReentrantLock$NonfairSync@68e7d329 owned by "thrift-server-optimize-manager-20440" Id=641953
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@68e7d329
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at com.netease.arctic.server.persistence.StatedPersistentBase.invokeConsisitency(StatedPersistentBase.java:49)
at com.netease.arctic.server.optimizing.TaskRuntime.tryCanceling(TaskRuntime.java:146)
at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$1010/1409555188.accept(Unknown Source)
at java.util.HashMap$Values.forEach(HashMap.java:981)
at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.lambda$persistProcessCompleted$10(OptimizingQueue.java:604)
at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$1007/808873318.run(Unknown Source)
at com.netease.arctic.server.persistence.PersistentBase$$Lambda$278/886902042.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at com.netease.arctic.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:61)
at com.netease.arctic.server.optimizing.OptimizingQueue.access$400(OptimizingQueue.java:71)
at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.persistProcessCompleted(OptimizingQueue.java:603)
at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:437)
at com.netease.arctic.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:103)
at com.netease.arctic.server.optimizing.TaskRuntime$$Lambda$274/802047498.run(Unknown Source)
at com.netease.arctic.server.persistence.PersistentBase$$Lambda$278/886902042.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at com.netease.arctic.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:61)
at com.netease.arctic.server.persistence.StatedPersistentBase.invokeConsisitency(StatedPersistentBase.java:52)
at com.netease.arctic.server.optimizing.TaskRuntime.complete(TaskRuntime.java:76)
at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:230)
at sun.reflect.GeneratedMethodAccessor225.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
at com.sun.proxy.$Proxy48.completeTask(Unknown Source)
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583)
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138)
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524)
at org.apache.thrift.server.Invocation.run(Invocation.java:18)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
"thrift-server-optimize-manager-20440" Id=641953 WAITING on java.util.concurrent.locks.ReentrantLock$NonfairSync@77281271 owned by "thrift-server-optimize-manager-20575" Id=646912
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@77281271
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:407)
at com.netease.arctic.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:103)
at com.netease.arctic.server.optimizing.TaskRuntime$$Lambda$274/802047498.run(Unknown Source)
at com.netease.arctic.server.persistence.PersistentBase$$Lambda$278/886902042.accept(Unknown Source)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at com.netease.arctic.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:61)
at com.netease.arctic.server.persistence.StatedPersistentBase.invokeConsisitency(StatedPersistentBase.java:52)
at com.netease.arctic.server.optimizing.TaskRuntime.complete(TaskRuntime.java:76)
at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:230)
at sun.reflect.GeneratedMethodAccessor225.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
at com.sun.proxy.$Proxy48.completeTask(Unknown Source)
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583)
at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138)
at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524)
at org.apache.thrift.server.Invocation.run(Invocation.java:18)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
@rfyu Thanks a lot for your pointer!
"thrift-server-optimize-manager-20575" Id=646912 WAITING o
This is really a big problem which has never been considered.
Do you have any ideas to solve this?
We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation, to clear the path for the other threads to accept a newer result.
Can we check the current status of a task before canceling it, and only cancel tasks that can be canceled, instead of canceling all tasks?
@rfyu I'm afraid it can't completely solve the problem, as the task condition can change after the check and before the cancelation happens.
Please take a look at https://github.com/NetEase/amoro/pull/2644 @majin1102 @rfyu
@rfyu I'm afraid it can't completely solve the problem, as the task condition can change after the check and before the cancelation happens.
You are right, I didn't consider it carefully
I found a related issue with this: #2380.
We may need to resolve it ASAP.
We may need to resolve it ASAP.
Agreed. Please help review https://github.com/NetEase/amoro/pull/2644 when available.
We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation, to clear the path for the other threads to accept a newer result.
Thanks for your idea.
In my opinion, this is a little bit complex. I suggest considering to remove lock from TaskRuntime cause it already has a CAS lock(status.accept(xxx)). How about take a try?
@link3280 Thank you very much for continuously driving the resolution of this issue. I have read the codes in #2644. I understand your core idea is to downgrade the cancel operation for all tasks, allowing them to fail to avoid deadlock. I believe this idea indeed addresses the problem, but it does introduce complexity to the code and may not be as straightforward to understand.
So I tried to find more solutions, and now I have a new approach: The task completion operation is very special. It not only changes the state of the task but also triggers the checking and changing of the process state. Therefore, it first acquires its own persistent lock before acquiring the process lock. However, if we change the order of lock acquisition, it seems we can effectively avoid deadlock. Specifically, acquiring the process lock first, then acquiring the task persistent lock.
Translating this into the current codebase, we can directly call TableOptimizingProcess#acceptResult method in the TaskRuntime#complete function. Inside it, we can first acquire the process lock, then proceed with the checking and state changes for both task and process.
HDYT? @link3280 @majin1102 @rfyu
I suggest considering to remove lock from TaskRuntime cause it already has a CAS lock(status.accept(xxx)). How about take a try?
The task runtime persistent lock is managed by the persistent framework. I think it is not easy to be removed.
We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation, to clear the path for the other threads to accept a newer result.
Thanks for your idea.
In my opinion, this is a little bit complex. I suggest considering to remove lock from TaskRuntime cause it already has a CAS lock(status.accept(xxx)). How about take a try?
I'd like to take the CAS approach if it's a project from scratch, but I'm afraid it would break the current persistent layer as @zhoujinsong says. Nonetheless, I would have a try.
@link3280 Thank you very much for continuously driving the resolution of this issue. I have read the codes in #2644. I understand your core idea is to downgrade the cancel operation for all tasks, allowing them to fail to avoid deadlock. I believe this idea indeed addresses the problem, but it does introduce complexity to the code and may not be as straightforward to understand.
So I tried to find more solutions, and now I have a new approach: The task completion operation is very special. It not only changes the state of the task but also triggers the checking and changing of the process state. Therefore, it first acquires its own persistent lock before acquiring the process lock. However, if we change the order of lock acquisition, it seems we can effectively avoid deadlock. Specifically, acquiring the process lock first, then acquiring the task persistent lock.
Translating this into the current codebase, we can directly call
TableOptimizingProcess#acceptResultmethod in theTaskRuntime#completefunction. Inside it, we can first acquire the process lock, then proceed with the checking and state changes for both task and process.HDYT? @link3280 @majin1102 @rfyu
@zhoujinsong I've thought about changing the lock acquisition order. The downside is that it causes a bigger footprint on the locks and harms the performance.
Normally, at least 99.9% of the tasks (either in a successful or failed process) would succeed and commit their results. Currently, they're doing it independently and no locks are shared. If they need to acquire a lock on the process, then the blocking commit time would be longer, especially for large tables with lots of optimizers. It may not be worth sacrificing the performance of general scenarios to ensure a cleanup operation that is a secondary priority, as it could be aborted after a timeout.
WDYT?
@link3280 Thank you very much for continuously driving the resolution of this issue. I have read the codes in #2644. I understand your core idea is to downgrade the cancel operation for all tasks, allowing them to fail to avoid deadlock. I believe this idea indeed addresses the problem, but it does introduce complexity to the code and may not be as straightforward to understand. So I tried to find more solutions, and now I have a new approach: The task completion operation is very special. It not only changes the state of the task but also triggers the checking and changing of the process state. Therefore, it first acquires its own persistent lock before acquiring the process lock. However, if we change the order of lock acquisition, it seems we can effectively avoid deadlock. Specifically, acquiring the process lock first, then acquiring the task persistent lock. Translating this into the current codebase, we can directly call
TableOptimizingProcess#acceptResultmethod in theTaskRuntime#completefunction. Inside it, we can first acquire the process lock, then proceed with the checking and state changes for both task and process. HDYT? @link3280 @majin1102 @rfyu@zhoujinsong I've thought about changing the lock acquisition order. The downside is that it causes a bigger footprint on the locks and harms the performance.
Normally, at least 99.9% of the tasks (either in a successful or failed process) would succeed and commit their results. Currently, they're doing it independently and no locks are shared. If they need to acquire a lock on the process, then the blocking commit time would be longer, especially for large tables with lots of optimizers. WDYT?
I think you are right about this.
Let me talk more about my thought: I propose creating a CAS state lock (based on thread ID) version of invokeConsistency() for TaskRuntime. When the CAS operation returns false, we can throw a concurrency exception. This approach adheres to a non-lock design, assuming that concurrency should not occur here or is considered illegal. Given that TaskRuntime already has a CAS lock, this solution seems quite reasonable.
In any case, let’s fix this in a proper manner.