amoro icon indicating copy to clipboard operation
amoro copied to clipboard

[Bug]: Optimizing process stuck after a few failures

Open link3280 opened this issue 1 year ago • 31 comments

What happened?

If an optimizing task fails constantly (e.g. because of corrupted files), the optimizing task would be stuck at the RUNNING state after a few retries, while the task has already failed at the optimizer side.

截屏2024-03-11 16 55 27

Affects Versions

0.5.1

What engines are you seeing the problem on?

AMS, Optimizer

How to reproduce

  1. create a table with some data that meets the optimizing conditions.
  2. truncate a data file that would take part in the optimizing compaction.
  3. AMS tries to optimize the table but fails and retries.
  4. after a few retries, AMS would lose track of the task, and leave the task stuck at RUNNING state.

Relevant log output

No response

Anything else

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

link3280 avatar Mar 11 '24 09:03 link3280

Thanks for feeding back, this seems a scheduing issue.

Did you truncate data files between 18:09 and 24:09? or it happened before all these failures in the picture. It would really help if you could describe more

If there are logs would help more

majin1102 avatar Mar 12 '24 11:03 majin1102

@majin1102 Thanks for your reply!

The table has been corrupted since the very beginning. The failed executions were for the same reason.

I'm digging into more details.

link3280 avatar Mar 13 '24 02:03 link3280

The AMS log shows that the completed task was not found. No leader election happened at that period. It looks like a bug in the task lifecycle.

2024-03-11 00:24:09,755 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:1), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134213316, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=31, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,756 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:2), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134095612, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=27, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,757 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:3), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134166571, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=21, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,758 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:4), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134215164, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=20, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,759 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:5), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134213859, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=17, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,760 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:6), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134129656, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=17, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,761 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:7), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134131430, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=24, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,762 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:8), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134144945, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=19, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,762 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:9), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=132981844, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=32, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,763 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:10), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=50579960, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=7, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 6) polled task OptimizingTaskId(processId:1710087849397, taskId:3)
2024-03-11 00:24:09,870 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:3) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 6)
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 3) polled task OptimizingTaskId(processId:1710087849397, taskId:5)
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 9) polled task OptimizingTaskId(processId:1710087849397, taskId:2)
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 4) polled task OptimizingTaskId(processId:1710087849397, taskId:4)
2024-03-11 00:24:09,879 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:5) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 3)
2024-03-11 00:24:09,879 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:2) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 9)
2024-03-11 00:24:09,880 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:4) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 4)
2024-03-11 00:24:10,525 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 5) polled task OptimizingTaskId(processId:1710087849397, taskId:7)
2024-03-11 00:24:10,529 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:7) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 5)
2024-03-11 00:24:10,831 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 1) polled task OptimizingTaskId(processId:1710087849397, taskId:6)
2024-03-11 00:24:10,836 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:6) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 1)
2024-03-11 00:24:11,006 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 0) polled task OptimizingTaskId(processId:1710087849397, taskId:9)
2024-03-11 00:24:11,011 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:9) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 0)
2024-03-11 00:24:11,997 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 2) polled task OptimizingTaskId(processId:1710087849397, taskId:8)
2024-03-11 00:24:12,002 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:8) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 2)
2024-03-11 00:24:12,670 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 7) polled task OptimizingTaskId(processId:1710087849397, taskId:10)
2024-03-11 00:24:12,689 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:10) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 7)
2024-03-11 00:24:13,766 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:13,770 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:17,065 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:17,122 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:17,126 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:20,337 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:20,393 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:20,398 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:21,517 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:10)
2024-03-11 00:24:23,596 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:23,653 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:23,657 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:26,908 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:26,966 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:26,970 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:30,496 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:30,554 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:30,558 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:33,765 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:33,822 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:33,825 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:34,586 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:9)
2024-03-11 00:24:34,822 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:5)
2024-03-11 00:24:34,835 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:7)
2024-03-11 00:24:35,017 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:6)
2024-03-11 00:24:37,232 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:37,473 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:4)
2024-03-11 00:24:37,589 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:3)
2024-03-11 00:24:37,590 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:3) not found.
    at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202]
    at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
    at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2024-03-11 00:24:38,215 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:2)
2024-03-11 00:24:38,218 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:2) not found.
    at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202]
    at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
    at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2024-03-11 00:24:41,672 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:8)2024-03-11 00:24:41,673 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:8) not found.
    at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202]
    at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
    at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]

link3280 avatar Mar 13 '24 02:03 link3280

A solution would be to fail the current process if a TaskNotFoundException happens. Otherwise, the process would be stuck forever. We could file another issue to further dig into and resolve the unexcepted TaskNotFoundException. WDYT? @majin1102

link3280 avatar Mar 13 '24 04:03 link3280

Thanks so much for the further details. The log really narrows situations.

Like you said, apparently the task has lost when optimizer completes the task(the tasks not encounter corrupted file). The actual happening here is not process retry, instead every failure is a single planned process. So If the loss happens after constant times of failure, it doesn't make sense cause each process is independent.

I wonder if there are expiring optimizer events in the log, or any other optimizer related logs could trigger task clean operations. Please check and let me know, meanwhile I would try to reproduce it in my own evironment.

majin1102 avatar Mar 13 '24 15:03 majin1102

WDYT

If TaskNotFoundException occurs, that should mean the process lifecyle has been ended. That's the design and a incosistency problem apparently occurs here. I think we'd better figure out the incosistency (what and how) and determine the solution.

Anyway I am open to fire another issue to optimize or resolve this situaion if we figure out the problem.

majin1102 avatar Mar 13 '24 15:03 majin1102

Hey, I checked the stack and found that the code here has been deeply refactored.

I wonder if 0.5.1 has encountered similar inconsistency feed back? @zhoujinsong

majin1102 avatar Mar 13 '24 15:03 majin1102

The AMS log shows that the completed task was not found. No leader election happened at that period. It looks like a bug in the task lifecycle.

2024-03-11 00:24:09,755 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:1), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134213316, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=31, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,756 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:2), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134095612, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=27, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,757 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:3), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134166571, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=21, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,758 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:4), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134215164, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=20, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,759 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:5), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134213859, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=17, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,760 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:6), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134129656, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=17, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,761 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:7), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134131430, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=24, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,762 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:8), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=134144945, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=19, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,762 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:9), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=132981844, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=32, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,763 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.optimizing.OptimizingQueue] [] - catalog.db.table(tableId=53865906) plan new task OptimizingTaskId(processId:1710087849397, taskId:10), summary MetricsSummary{newFileSize=0, newFileCnt=0, rewriteDataSize=50579960, rewritePosDataSize=0, equalityDeleteSize=0, positionalDeleteSize=0, rewriteDataFileCnt=7, reRowDeletedDataFileCnt=0, eqDeleteFileCnt=0, posDeleteFileCnt=0}
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 6) polled task OptimizingTaskId(processId:1710087849397, taskId:3)
2024-03-11 00:24:09,870 INFO [thrift-server-OptimizeManager-4849] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:3) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 6)
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 3) polled task OptimizingTaskId(processId:1710087849397, taskId:5)
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 9) polled task OptimizingTaskId(processId:1710087849397, taskId:2)
2024-03-11 00:24:09,866 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 4) polled task OptimizingTaskId(processId:1710087849397, taskId:4)
2024-03-11 00:24:09,879 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:5) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 3)
2024-03-11 00:24:09,879 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:2) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 9)
2024-03-11 00:24:09,880 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:4) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 4)
2024-03-11 00:24:10,525 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 5) polled task OptimizingTaskId(processId:1710087849397, taskId:7)
2024-03-11 00:24:10,529 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:7) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 5)
2024-03-11 00:24:10,831 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 1) polled task OptimizingTaskId(processId:1710087849397, taskId:6)
2024-03-11 00:24:10,836 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:6) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 1)
2024-03-11 00:24:11,006 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 0) polled task OptimizingTaskId(processId:1710087849397, taskId:9)
2024-03-11 00:24:11,011 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:9) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 0)
2024-03-11 00:24:11,997 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 2) polled task OptimizingTaskId(processId:1710087849397, taskId:8)
2024-03-11 00:24:12,002 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:8) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 2)
2024-03-11 00:24:12,670 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 7) polled task OptimizingTaskId(processId:1710087849397, taskId:10)
2024-03-11 00:24:12,689 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:10) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 7)
2024-03-11 00:24:13,766 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:13,770 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:17,065 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:17,122 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:17,126 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:20,337 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:20,393 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:20,398 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:21,517 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:10)
2024-03-11 00:24:23,596 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:23,653 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:23,657 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:26,908 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:26,966 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:26,970 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:30,496 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:30,554 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:30,558 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:33,765 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:33,822 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8) polled task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:33,825 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Ack task OptimizingTaskId(processId:1710087849397, taskId:1) by optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d (threadId 8)
2024-03-11 00:24:34,586 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:9)
2024-03-11 00:24:34,822 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:5)
2024-03-11 00:24:34,835 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:7)
2024-03-11 00:24:35,017 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:6)
2024-03-11 00:24:37,232 INFO [thrift-server-OptimizeManager-4850] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:1)
2024-03-11 00:24:37,473 INFO [thrift-server-OptimizeManager-4851] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:4)
2024-03-11 00:24:37,589 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:3)
2024-03-11 00:24:37,590 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:3) not found.
    at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202]
    at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
    at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2024-03-11 00:24:38,215 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:2)
2024-03-11 00:24:38,218 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:2) not found.
    at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202]
    at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
    at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]
2024-03-11 00:24:41,672 INFO [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.DefaultOptimizingService] [] - Optimizer 6a067ba7-542d-4a75-a5d4-36c7ea375e8d complete task OptimizingTaskId(processId:1710087849397, taskId:8)2024-03-11 00:24:41,673 ERROR [thrift-server-OptimizeManager-4852] [com.netease.arctic.server.TableManagementService] [] - Thrift service:DefaultOptimizingService.completeTask execute failed
com.netease.arctic.server.exception.TaskNotFoundException: Task OptimizingTaskId(processId:1710087849397, taskId:8) not found.
    at com.netease.arctic.server.optimizing.OptimizingQueue.lambda$completeTask$11(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at java.util.Optional.orElseThrow(Optional.java:290) ~[?:1.8.0_202]
    at com.netease.arctic.server.optimizing.OptimizingQueue.completeTask(OptimizingQueue.java:212) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:154) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at sun.reflect.GeneratedMethodAccessor284.invoke(Unknown Source) ~[?:?]
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_202]
    at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
    at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:53) ~[amoro-ams-server-0.5.2-GDC1.2.2.jar:?]
    at com.sun.proxy.$Proxy52.completeTask(Unknown Source) ~[?:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562) ~[amoro-ams-api-0.5.2-GDC1.2.1.jar:?]
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) ~[libthrift-0.13.0.jar:0.13.0]
    at org.apache.thrift.server.Invocation.run(Invocation.java:18) ~[libthrift-0.13.0.jar:0.13.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_202]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_202]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_202]

image

Could you search two pattern logs:

LOG.info("Task {} is canceled, remove it from executing task map", task.getTaskId());

LOG.info("Task {} is suspending, since it's optimizer is expired, put it to retry queue, optimizer {}",
          task.getTaskId(), task.getOptimizingThread());

majin1102 avatar Mar 13 '24 15:03 majin1102

@majin1102 The logs I provided were found by the process id, so sadly there are no logs suggesting the task is removed. I double-checked the logs as you said and there was an optimizer expiring at 5 am, which is not relevant.

link3280 avatar Mar 14 '24 02:03 link3280

I'm trying to reproduce it again in the testing environment to get more information. Please let me know if you have any ideas. Thanks! @majin1102 @zhoujinsong

link3280 avatar Mar 14 '24 03:03 link3280

I've probably found the root cause. If a task fails and exhausts the max retries (default to 5), AMS will try to cancel all tasks (which causes TaskNotFoundException if not succeeds) and fail the process.

https://github.com/NetEase/amoro/blob/0e7d291955147545cbf88200087bfe572d8bbb9c/ams/server/src/main/java/com/netease/arctic/server/optimizing/OptimizingQueue.java#L446-L450

However, in my case, the cancellation process didn't work somehow, leaving the unfinished tasks at ACKED status and the process at RUNNING status.

+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+
| process_id    | task_id | table_id | catalog_name | db_name     | table_name             | status  | retry_num | start_time          | end_time            |
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+
| 1710087849397 |       1 | 53865906 | catalog      | db          | table                  | FAILED  |         6 | 2024-03-11 00:24:34 | 2024-03-11 00:24:37 |
| 1710087849397 |       2 | 53865906 | catalog      | db          | table                  | ACKED   |         0 | 2024-03-11 00:24:10 | NULL                |
| 1710087849397 |       3 | 53865906 | catalog      | db          | table                  | ACKED   |         0 | 2024-03-11 00:24:10 | NULL                |
| 1710087849397 |       4 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:37 |
| 1710087849397 |       5 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:35 |
| 1710087849397 |       6 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 |       7 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 |       8 | 53865906 | catalog      | db          | table                  | ACKED   |         0 | 2024-03-11 00:24:12 | NULL                |
| 1710087849397 |       9 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 |      10 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:13 | 2024-03-11 00:24:22 |
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+

In contrast, the previous failed process looks like this:

+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
| process_id    | task_id | table_id | catalog_name | db_name     | table_name             | status   | retry_num | start_time          | end_time            |
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
| 1710087488523 |       1 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 |
| 1710087488523 |       2 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:35 |
| 1710087488523 |       3 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:34 |
| 1710087488523 |       4 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 |
| 1710087488523 |       5 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:40 |
| 1710087488523 |       6 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:58 |
| 1710087488523 |       7 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:45 |
| 1710087488523 |       8 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:12 | 2024-03-11 00:18:38 |
| 1710087488523 |       9 | 53865906 | catalog      | db          | table                  | CANCELED |         6 | 2024-03-11 00:21:06 | 2024-03-11 00:21:19 |
| 1710087488523 |      10 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:13 | 2024-03-11 00:18:24 |
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+

link3280 avatar Mar 14 '24 11:03 link3280

I've probably found the root cause. If a task fails and exhausts the max retries (default to 5), AMS will try to cancel all tasks (which causes TaskNotFoundException if not succeeds) and fail the process.

https://github.com/NetEase/amoro/blob/0e7d291955147545cbf88200087bfe572d8bbb9c/ams/server/src/main/java/com/netease/arctic/server/optimizing/OptimizingQueue.java#L446-L450

However, in my case, the cancellation process didn't work somehow, leaving the unfinished tasks at ACKED status and the process at RUNNING status.

+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+
| process_id    | task_id | table_id | catalog_name | db_name     | table_name             | status  | retry_num | start_time          | end_time            |
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+
| 1710087849397 |       1 | 53865906 | catalog      | db          | table                  | FAILED  |         6 | 2024-03-11 00:24:34 | 2024-03-11 00:24:37 |
| 1710087849397 |       2 | 53865906 | catalog      | db          | table                  | ACKED   |         0 | 2024-03-11 00:24:10 | NULL                |
| 1710087849397 |       3 | 53865906 | catalog      | db          | table                  | ACKED   |         0 | 2024-03-11 00:24:10 | NULL                |
| 1710087849397 |       4 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:37 |
| 1710087849397 |       5 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:35 |
| 1710087849397 |       6 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 |       7 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 |       8 | 53865906 | catalog      | db          | table                  | ACKED   |         0 | 2024-03-11 00:24:12 | NULL                |
| 1710087849397 |       9 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 |      10 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:13 | 2024-03-11 00:24:22 |
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+

In contrast, the previous failed process looks like this:

+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
| process_id    | task_id | table_id | catalog_name | db_name     | table_name             | status   | retry_num | start_time          | end_time            |
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
| 1710087488523 |       1 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 |
| 1710087488523 |       2 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:35 |
| 1710087488523 |       3 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:34 |
| 1710087488523 |       4 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 |
| 1710087488523 |       5 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:40 |
| 1710087488523 |       6 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:58 |
| 1710087488523 |       7 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:45 |
| 1710087488523 |       8 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:12 | 2024-03-11 00:18:38 |
| 1710087488523 |       9 | 53865906 | catalog      | db          | table                  | CANCELED |         6 | 2024-03-11 00:21:06 | 2024-03-11 00:21:19 |
| 1710087488523 |      10 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:13 | 2024-03-11 00:18:24 |
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+

Hi, Link.

You really did a good job for reaching here. I've checked the code and found the underlying bug. The memory operation should follow persisting operation which is reversed in the code you posted

Bug I think you should be able to locate a log like 'LOG.error("accept result error:", e)' That will make sense finally and the issue should be very accidental when database encountered an error. Could you check the log again?

Thanks

majin1102 avatar Mar 14 '24 18:03 majin1102

@majin1102 Weird enough, I couldn't find the related logs you said.

I tried to reproduce it in the test environment, but AMS works as expected even after 30+ failed process. I assume it's related to the workload, as I have 100+ tables in the test environment but 10k in production.

FYI, I find our flink optimizers have a memory leak that's probably caused by glibc and get OOM killed sometimes. That may also contribute to the inconsistency problem.

link3280 avatar Mar 15 '24 09:03 link3280

I've probably found the root cause. If a task fails and exhausts the max retries (default to 5), AMS will try to cancel all tasks (which causes TaskNotFoundException if not succeeds) and fail the process.

https://github.com/NetEase/amoro/blob/0e7d291955147545cbf88200087bfe572d8bbb9c/ams/server/src/main/java/com/netease/arctic/server/optimizing/OptimizingQueue.java#L446-L450

However, in my case, the cancellation process didn't work somehow, leaving the unfinished tasks at ACKED status and the process at RUNNING status.

+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+
| process_id    | task_id | table_id | catalog_name | db_name     | table_name             | status  | retry_num | start_time          | end_time            |
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+
| 1710087849397 |       1 | 53865906 | catalog      | db          | table                  | FAILED  |         6 | 2024-03-11 00:24:34 | 2024-03-11 00:24:37 |
| 1710087849397 |       2 | 53865906 | catalog      | db          | table                  | ACKED   |         0 | 2024-03-11 00:24:10 | NULL                |
| 1710087849397 |       3 | 53865906 | catalog      | db          | table                  | ACKED   |         0 | 2024-03-11 00:24:10 | NULL                |
| 1710087849397 |       4 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:37 |
| 1710087849397 |       5 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:10 | 2024-03-11 00:24:35 |
| 1710087849397 |       6 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 |       7 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 |       8 | 53865906 | catalog      | db          | table                  | ACKED   |         0 | 2024-03-11 00:24:12 | NULL                |
| 1710087849397 |       9 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:11 | 2024-03-11 00:24:35 |
| 1710087849397 |      10 | 53865906 | catalog      | db          | table                  | SUCCESS |         0 | 2024-03-11 00:24:13 | 2024-03-11 00:24:22 |
+---------------+---------+----------+--------------+-------------+------------------------+---------+-----------+---------------------+---------------------+

In contrast, the previous failed process looks like this:

+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
| process_id    | task_id | table_id | catalog_name | db_name     | table_name             | status   | retry_num | start_time          | end_time            |
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
| 1710087488523 |       1 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 |
| 1710087488523 |       2 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:35 |
| 1710087488523 |       3 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:34 |
| 1710087488523 |       4 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:09 | 2024-03-11 00:18:37 |
| 1710087488523 |       5 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:40 |
| 1710087488523 |       6 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:58 |
| 1710087488523 |       7 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:10 | 2024-03-11 00:18:45 |
| 1710087488523 |       8 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:12 | 2024-03-11 00:18:38 |
| 1710087488523 |       9 | 53865906 | catalog      | db          | table                  | CANCELED |         6 | 2024-03-11 00:21:06 | 2024-03-11 00:21:19 |
| 1710087488523 |      10 | 53865906 | catalog      | db          | table                  | SUCCESS  |         0 | 2024-03-11 00:18:13 | 2024-03-11 00:18:24 |
+---------------+---------+----------+--------------+-------------+------------------------+----------+-----------+---------------------+---------------------+
 However, in my case, the cancellation process didn't work somehow, leaving the unfinished tasks at ACKED status and the process at RUNNING status.

You can check if there is a deadlock in jstack. I have also encountered the same problem.

rfyu avatar Mar 16 '24 10:03 rfyu

For me, this is why the state is not being updated.

"thrift-server-optimize-manager-20575" Id=646912 WAITING on java.util.concurrent.locks.ReentrantLock$NonfairSync@68e7d329 owned by "thrift-server-optimize-manager-20440" Id=641953
    at sun.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@68e7d329
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
    at com.netease.arctic.server.persistence.StatedPersistentBase.invokeConsisitency(StatedPersistentBase.java:49)
    at com.netease.arctic.server.optimizing.TaskRuntime.tryCanceling(TaskRuntime.java:146)
    at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$1010/1409555188.accept(Unknown Source)
    at java.util.HashMap$Values.forEach(HashMap.java:981)
    at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.lambda$persistProcessCompleted$10(OptimizingQueue.java:604)
    at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess$$Lambda$1007/808873318.run(Unknown Source)
    at com.netease.arctic.server.persistence.PersistentBase$$Lambda$278/886902042.accept(Unknown Source)
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
    at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
    at com.netease.arctic.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:61)
    at com.netease.arctic.server.optimizing.OptimizingQueue.access$400(OptimizingQueue.java:71)
    at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.persistProcessCompleted(OptimizingQueue.java:603)
    at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:437)
    at com.netease.arctic.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:103)
    at com.netease.arctic.server.optimizing.TaskRuntime$$Lambda$274/802047498.run(Unknown Source)
    at com.netease.arctic.server.persistence.PersistentBase$$Lambda$278/886902042.accept(Unknown Source)
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
    at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
    at com.netease.arctic.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:61)
    at com.netease.arctic.server.persistence.StatedPersistentBase.invokeConsisitency(StatedPersistentBase.java:52)
    at com.netease.arctic.server.optimizing.TaskRuntime.complete(TaskRuntime.java:76)
    at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:230)
    at sun.reflect.GeneratedMethodAccessor225.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
    at com.sun.proxy.$Proxy48.completeTask(Unknown Source)
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583)
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562)
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
    at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138)
    at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524)
    at org.apache.thrift.server.Invocation.run(Invocation.java:18)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

"thrift-server-optimize-manager-20440" Id=641953 WAITING on java.util.concurrent.locks.ReentrantLock$NonfairSync@77281271 owned by "thrift-server-optimize-manager-20575" Id=646912
    at sun.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.locks.ReentrantLock$NonfairSync@77281271
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
    at com.netease.arctic.server.optimizing.OptimizingQueue$TableOptimizingProcess.acceptResult(OptimizingQueue.java:407)
    at com.netease.arctic.server.optimizing.TaskRuntime.lambda$complete$0(TaskRuntime.java:103)
    at com.netease.arctic.server.optimizing.TaskRuntime$$Lambda$274/802047498.run(Unknown Source)
    at com.netease.arctic.server.persistence.PersistentBase$$Lambda$278/886902042.accept(Unknown Source)
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
    at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
    at com.netease.arctic.server.persistence.PersistentBase.doAsTransaction(PersistentBase.java:61)
    at com.netease.arctic.server.persistence.StatedPersistentBase.invokeConsisitency(StatedPersistentBase.java:52)
    at com.netease.arctic.server.optimizing.TaskRuntime.complete(TaskRuntime.java:76)
    at com.netease.arctic.server.DefaultOptimizingService.completeTask(DefaultOptimizingService.java:230)
    at sun.reflect.GeneratedMethodAccessor225.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.netease.arctic.server.utils.ThriftServiceProxy.invoke(ThriftServiceProxy.java:56)
    at com.sun.proxy.$Proxy48.completeTask(Unknown Source)
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:583)
    at com.netease.arctic.ams.api.OptimizingService$Processor$completeTask.getResult(OptimizingService.java:562)
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
    at org.apache.thrift.TMultiplexedProcessor.process(TMultiplexedProcessor.java:138)
    at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524)
    at org.apache.thrift.server.Invocation.run(Invocation.java:18)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

rfyu avatar Mar 16 '24 10:03 rfyu

@rfyu Thanks a lot for your pointer!

link3280 avatar Mar 16 '24 16:03 link3280

"thrift-server-optimize-manager-20575" Id=646912 WAITING o

This is really a big problem which has never been considered.

Do you have any ideas to solve this?

majin1102 avatar Mar 16 '24 17:03 majin1102

We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation, to clear the path for the other threads to accept a newer result.

link3280 avatar Mar 16 '24 17:03 link3280

Can we check the current status of a task before canceling it, and only cancel tasks that can be canceled, instead of canceling all tasks? image

rfyu avatar Mar 17 '24 02:03 rfyu

@rfyu I'm afraid it can't completely solve the problem, as the task condition can change after the check and before the cancelation happens.

link3280 avatar Mar 17 '24 03:03 link3280

Please take a look at https://github.com/NetEase/amoro/pull/2644 @majin1102 @rfyu

link3280 avatar Mar 17 '24 04:03 link3280

@rfyu I'm afraid it can't completely solve the problem, as the task condition can change after the check and before the cancelation happens.

You are right, I didn't consider it carefully

rfyu avatar Mar 17 '24 05:03 rfyu

I found a related issue with this: #2380.

We may need to resolve it ASAP.

zhoujinsong avatar Mar 18 '24 03:03 zhoujinsong

We may need to resolve it ASAP.

Agreed. Please help review https://github.com/NetEase/amoro/pull/2644 when available.

link3280 avatar Mar 18 '24 06:03 link3280

We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation, to clear the path for the other threads to accept a newer result.

Thanks for your idea.

In my opinion, this is a little bit complex. I suggest considering to remove lock from TaskRuntime cause it already has a CAS lock(status.accept(xxx)). How about take a try?

majin1102 avatar Mar 18 '24 17:03 majin1102

@link3280 Thank you very much for continuously driving the resolution of this issue. I have read the codes in #2644. I understand your core idea is to downgrade the cancel operation for all tasks, allowing them to fail to avoid deadlock. I believe this idea indeed addresses the problem, but it does introduce complexity to the code and may not be as straightforward to understand.

So I tried to find more solutions, and now I have a new approach: The task completion operation is very special. It not only changes the state of the task but also triggers the checking and changing of the process state. Therefore, it first acquires its own persistent lock before acquiring the process lock. However, if we change the order of lock acquisition, it seems we can effectively avoid deadlock. Specifically, acquiring the process lock first, then acquiring the task persistent lock.

Translating this into the current codebase, we can directly call TableOptimizingProcess#acceptResult method in the TaskRuntime#complete function. Inside it, we can first acquire the process lock, then proceed with the checking and state changes for both task and process.

HDYT? @link3280 @majin1102 @rfyu

zhoujinsong avatar Mar 19 '24 01:03 zhoujinsong

I suggest considering to remove lock from TaskRuntime cause it already has a CAS lock(status.accept(xxx)). How about take a try?

The task runtime persistent lock is managed by the persistent framework. I think it is not easy to be removed.

zhoujinsong avatar Mar 19 '24 01:03 zhoujinsong

We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation, to clear the path for the other threads to accept a newer result.

Thanks for your idea.

In my opinion, this is a little bit complex. I suggest considering to remove lock from TaskRuntime cause it already has a CAS lock(status.accept(xxx)). How about take a try?

I'd like to take the CAS approach if it's a project from scratch, but I'm afraid it would break the current persistent layer as @zhoujinsong says. Nonetheless, I would have a try.

link3280 avatar Mar 19 '24 01:03 link3280

@link3280 Thank you very much for continuously driving the resolution of this issue. I have read the codes in #2644. I understand your core idea is to downgrade the cancel operation for all tasks, allowing them to fail to avoid deadlock. I believe this idea indeed addresses the problem, but it does introduce complexity to the code and may not be as straightforward to understand.

So I tried to find more solutions, and now I have a new approach: The task completion operation is very special. It not only changes the state of the task but also triggers the checking and changing of the process state. Therefore, it first acquires its own persistent lock before acquiring the process lock. However, if we change the order of lock acquisition, it seems we can effectively avoid deadlock. Specifically, acquiring the process lock first, then acquiring the task persistent lock.

Translating this into the current codebase, we can directly call TableOptimizingProcess#acceptResult method in the TaskRuntime#complete function. Inside it, we can first acquire the process lock, then proceed with the checking and state changes for both task and process.

HDYT? @link3280 @majin1102 @rfyu

@zhoujinsong I've thought about changing the lock acquisition order. The downside is that it causes a bigger footprint on the locks and harms the performance.

Normally, at least 99.9% of the tasks (either in a successful or failed process) would succeed and commit their results. Currently, they're doing it independently and no locks are shared. If they need to acquire a lock on the process, then the blocking commit time would be longer, especially for large tables with lots of optimizers. It may not be worth sacrificing the performance of general scenarios to ensure a cleanup operation that is a secondary priority, as it could be aborted after a timeout.

WDYT?

link3280 avatar Mar 19 '24 02:03 link3280

@link3280 Thank you very much for continuously driving the resolution of this issue. I have read the codes in #2644. I understand your core idea is to downgrade the cancel operation for all tasks, allowing them to fail to avoid deadlock. I believe this idea indeed addresses the problem, but it does introduce complexity to the code and may not be as straightforward to understand. So I tried to find more solutions, and now I have a new approach: The task completion operation is very special. It not only changes the state of the task but also triggers the checking and changing of the process state. Therefore, it first acquires its own persistent lock before acquiring the process lock. However, if we change the order of lock acquisition, it seems we can effectively avoid deadlock. Specifically, acquiring the process lock first, then acquiring the task persistent lock. Translating this into the current codebase, we can directly call TableOptimizingProcess#acceptResult method in the TaskRuntime#complete function. Inside it, we can first acquire the process lock, then proceed with the checking and state changes for both task and process. HDYT? @link3280 @majin1102 @rfyu

@zhoujinsong I've thought about changing the lock acquisition order. The downside is that it causes a bigger footprint on the locks and harms the performance.

Normally, at least 99.9% of the tasks (either in a successful or failed process) would succeed and commit their results. Currently, they're doing it independently and no locks are shared. If they need to acquire a lock on the process, then the blocking commit time would be longer, especially for large tables with lots of optimizers. WDYT?

I think you are right about this.

Let me talk more about my thought: I propose creating a CAS state lock (based on thread ID) version of invokeConsistency() for TaskRuntime. When the CAS operation returns false, we can throw a concurrency exception. This approach adheres to a non-lock design, assuming that concurrency should not occur here or is considered illegal. Given that TaskRuntime already has a CAS lock, this solution seems quite reasonable.

In any case, let’s fix this in a proper manner.

majin1102 avatar Mar 19 '24 02:03 majin1102