Paul Lin
Paul Lin
I'm trying to reproduce it again in the testing environment to get more information. Please let me know if you have any ideas. Thanks! @majin1102 @zhoujinsong
I've probably found the root cause. If a task fails and exhausts the max retries (default to 5), AMS will try to cancel all tasks (which causes TaskNotFoundException if not...
@majin1102 Weird enough, I couldn't find the related logs you said. I tried to reproduce it in the test environment, but AMS works as expected even after 30+ failed process....
@rfyu Thanks a lot for your pointer!
We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation, to clear...
@rfyu I'm afraid it can't completely solve the problem, as the task condition can change after the check and before the cancelation happens.
Please take a look at https://github.com/NetEase/amoro/pull/2644 @majin1102 @rfyu
> We may need to resolve it ASAP. Agreed. Please help review https://github.com/NetEase/amoro/pull/2644 when available.
> > We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation,...
> @link3280 Thank you very much for continuously driving the resolution of this issue. I have read the codes in #2644. I understand your core idea is to downgrade the...