dolphinscheduler icon indicating copy to clipboard operation
dolphinscheduler copied to clipboard

task is waiting to excecuted for more than 12 hours and seems not to be overtimed

Open epitomizelu opened this issue 1 year ago • 8 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

it looks same as the bug 7441. the version is 3.1.8, and the cluster have 2 masters, 4 workers. I found a workflow instance running for more than 12 hours and it is abnormal. Then I found a task of the sub workflow is waiting to be executed for more than 10 hours. When I ended the workflow and restart it, the problem usually does not reproduce.

What you expected to happen

if a task waits for more than 5 minuts and can not be executed, the task should be failed.

How to reproduce

It is hard to reproduce, the workflow works normally for most of the time. And When I ended the workflow and restart it, the problem usually does not reproduce.

Anything else

No response

Version

3.1.x

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

epitomizelu avatar Feb 26 '24 07:02 epitomizelu

Please provide the master log related to the parent workflow and sub-workflow instances.

ruanwenjun avatar Feb 26 '24 13:02 ruanwenjun

hi @epitomizelu #14986 maybe can fix you problem, It may have actually been executed, but it has been running because of the message backlog of the subprocess type task. cc @ruanwenjun WDYT?

fuchanghai avatar Feb 27 '24 01:02 fuchanghai

hi @epitomizelu #14986 maybe can fix you problem, It may have actually been executed, but it has been running because of the message backlog of the subprocess type task. cc @ruanwenjun WDYT?

We may need to add metrics to record the size of xxCheckList in StateWheelExecuteThread.

ruanwenjun avatar Feb 27 '24 02:02 ruanwenjun

hi @epitomizelu #14986 maybe can fix you problem, It may have actually been executed, but it has been running because of the message backlog of the subprocess type task. cc @ruanwenjun WDYT?

We may need to add metrics to record the size of xxCheckList in StateWheelExecuteThread.

+1

fuchanghai avatar Feb 27 '24 06:02 fuchanghai

@ruanwenjun 你好,在使用带有subprocess的任务时,主流程到达 subprocess一直处于等待状态,实际上子节点已完成,看了主任务状态还是运行中,下图是任务执行状态和集群部署情况,期待你的回复 image image

leachli avatar Feb 27 '24 08:02 leachli

@ruanwenjun 你好,在使用带有subprocess的任务时,主流程到达 subprocess一直处于等待状态,实际上子节点已完成,看了主任务状态还是运行中,下图是任务执行状态和集群部署情况,期待你的回复 image image

看这个图,子工作流逻辑节点已经成功了?这个是哪个版本?如果是目前3.2.x应该没有这类问题了,在3.2.x子工作流节点采用拉的方式去查状态,可以避免之前由于推的方式推失败导致任务状态不更新的问题

ruanwenjun avatar Feb 28 '24 10:02 ruanwenjun

@ruanwenjun 你好,3.2x中包含子工作流时,如果子工作流中有任务失败,点击从失败节点重跑,子工作流中成功的任务也会全部被拉起来,你们又遇到类似的问题吗

maxiangmin avatar Mar 28 '24 02:03 maxiangmin

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Apr 28 '24 00:04 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar May 06 '24 00:05 github-actions[bot]