Paul Lin

Results 69 comments of Paul Lin

I'm trying to reproduce it again in the testing environment to get more information. Please let me know if you have any ideas. Thanks! @majin1102 @zhoujinsong

I've probably found the root cause. If a task fails and exhausts the max retries (default to 5), AMS will try to cancel all tasks (which causes TaskNotFoundException if not...

@majin1102 Weird enough, I couldn't find the related logs you said. I tried to reproduce it in the test environment, but AMS works as expected even after 30+ failed process....

@rfyu Thanks a lot for your pointer!

We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation, to clear...

@rfyu I'm afraid it can't completely solve the problem, as the task condition can change after the check and before the cancelation happens.

Please take a look at https://github.com/NetEase/amoro/pull/2644 @majin1102 @rfyu

> We may need to resolve it ASAP. Agreed. Please help review https://github.com/NetEase/amoro/pull/2644 when available.

> > We could add a timeout for task cancelation. When it times out, the thread accepting the current result should release the task runtime lock and skip the cancelation,...

> @link3280 Thank you very much for continuously driving the resolution of this issue. I have read the codes in #2644. I understand your core idea is to downgrade the...