dolphinscheduler [Bug] [MASTER] When Serial Wait workflow are frequently scheduled, serially waiting workflow might get stuck

Search before asking

[X] I had searched in the issues and found no similar issues.

What happened

I think the state synchronization mechanism has critical errors. 状态的转换是一个过程，但是现在流程之间的通知机制依赖瞬时状态。 State transition is a process, but currently, the notification mechanism between processes relies on instantaneous states. 在此，举一个例子，如果工作流也有超时配置，这个例子里，工作流超时后，下一个串行等待的工作流可能会卡住。 For example, let's imagine if the workflows also had timeout configurations，in this scenario, if a workflow times out, the next serially waiting workflow might get stuck.

@startuml
StateWheelExecuteThread --> WorkflowTimeoutStateEventHandler: send process A PROCESS_TIMEOUT event
WorkflowTimeoutStateEventHandler -> WorkflowExecuteRunnable: processTimeout
WorkflowExecuteRunnable --> WorkflowStateEventHandler: send process A STOP event
WorkflowStateEventHandler -> WorkflowExecuteRunnable: endProcess
WorkflowExecuteRunnable -> WorkflowExecuteRunnable:checkSerialProcess
WorkflowExecuteRunnable --> ProcessServiceImpl: send process B RECOVER_SERIAL_WAIT command
ProcessServiceImpl -> ProcessServiceImpl: STEP 1： handleCommand(if state of process A is RUNNING_PROCESS_STATE, state of proces B will change back to SERIAL_WAIT)
WorkflowStateEventHandler -> WorkflowStateEventHandler: STEP 2: update process A to STOP
@enduml

由于STEP 1和STEP 2的顺序无法保证，会导致后续所有实例都卡在“串行等待”状态。 Since the order of STEP 1 and STEP 2 cannot be guaranteed now, all subsequent instances might get stuck in the "serial wait" state.

What you expected to happen

恢复后续实例前，需要保证自己的状态更新完毕 Before resuming subsequent instances, you need to ensure that your own state has been fully updated.

How to reproduce

Create Workflow A and Workflow B, both scheduled to run every minute, and use Serial Wait execution type.
In Workflow A, create a SUB_PROCESS task node that references Workflow B.
Online both Workflow A and Workflow B, and online the schedules.
Observe the state changes.

Anything else

No response

Version

3.1.1

Are you willing to submit PR?

[] Yes I am willing to submit a PR!

Code of Conduct

[] I agree to follow this project's Code of Conduct

Jul 25 '24 06:07 CloudSen

Search before asking

[X] I had searched in the issues and found no similar issues.

What happened

State transition is a process, but now the notification mechanism between processes relies on transient state. State transition is a process, but currently, the notification mechanism between processes relies on instantaneous states. Here, take an example, if the workflow also has a timeout configuration, in this example, after the workflow times out, the next serially waiting workflow may be stuck. For example, let's imagine if the workflows also had timeout configurations, in this scenario, if a workflow times out, the next serially waiting workflow might get stuck.

@startuml
StateWheelExecuteThread --> WorkflowTimeoutStateEventHandler: send process A PROCESS_TIMEOUT event
WorkflowTimeoutStateEventHandler -> WorkflowExecuteRunnable: processTimeout
WorkflowExecuteRunnable --> WorkflowStateEventHandler: send process A STOP event
WorkflowStateEventHandler -> WorkflowExecuteRunnable: endProcess
WorkflowExecuteRunnable -> WorkflowExecuteRunnable:checkSerialProcess
WorkflowExecuteRunnable --> ProcessServiceImpl: send process B RECOVER_SERIAL_WAIT command
ProcessServiceImpl -> ProcessServiceImpl: STEP 1: handleCommand(if state of process A is RUNNING_PROCESS_STATE, state of process B will change back to SERIAL_WAIT)
WorkflowStateEventHandler -> WorkflowStateEventHandler: STEP 2: update process A to STOP
@enduml

Since the order of STEP 1 and STEP 2 cannot be guaranteed, all subsequent instances will be stuck in the "serial waiting" state. Since the order of STEP 1 and STEP 2 cannot be guaranteed now, all subsequent instances might get stuck in the "serial wait" state.

What you expected to happen

Before restoring subsequent instances, you need to ensure that your status has been updated. Before resuming subsequent instances, you need to ensure that your own state has been fully updated.

How to reproduce

Create Workflow A and Workflow B, both scheduled to run every minute.
In Workflow A, create a SUB_PROCESS task node that references Workflow B.
Online both Workflow A and Workflow B, and online the schedules.
Observe the state changes.

Anything else

No response

Version

dev

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Jul 25 '24 06:07 github-actions[bot]

cc @ruanwenjun

Jul 25 '24 07:07 SbloodyS

Maybe we need a cache queue for the RECOVER_SERIAL_WAIT command? If the command execution finds that the previous workflow has not been updated to a completed state in time, it should be put back in the queue and checked again after a while. When the previous workflow state has been fully updated, this command will successfully consume from the queue and transition from the SERIAL_WAIT state to the SUBMITTED_SUCCESS state.

Jul 26 '24 01:07 CloudSen

+1

Aug 07 '24 03:08 ZhaoRuidong

Currently, I resolved this issue by retrying the RECOVER_SERIAL_WAIT command when the previous workflow was not fully updated, and it works well.

Aug 09 '24 05:08 CloudSen

Right now the serial wait implementation is really unstable, there are a lot of case will cause it doesn't work well. e.g.

Concurrent trigger will cause multiple workflow instance running which should in serial wait.
Notify failed might cause the origin workflow instance cannot finish.
The workflow should deal with the notify logic, this make the workflow instance state transition more complex.

It's better to refactor this, use a global SerialWaitCoordinator to notify the serial wait workflow instance, the origin workflow instance don't need to care whether it need to do notification.

Aug 18 '24 14:08 ruanwenjun

Right now the serial wait implementation is really unstable, there are a lot of case will cause it doesn't work well. e.g.

Concurrent trigger will cause multiple workflow instance running which should in serial wait.

Notify failed might cause the origin workflow instance cannot finish.

The workflow should deal with the notify logic, this make the workflow instance state transition more complex.

It's better to refactor this, use a global SerialWaitCoordinator to notify the serial wait workflow instance, the origin workflow instance don't need to care whether it need to do notification.

Agree with your opinion, I have thought about a coordinator similar to SerialWaitCoordinator instead of notifying tasks themselves.

Is there a refactoring plan for this?
How should this bug in 3.1.x be fixed?

Retrying the RECOVER_SERIAL_WAIT command can only resolve instances where the next workflow state is already SERIAL_WAIT. However, it does not address cases where the next workflow instance is transitioning into a SERIAL_WAIT state.

Aug 19 '24 05:08 CloudSen

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

Sep 20 '24 00:09 github-actions[bot]

In version 3.2.1, I fixed a similar issue, which might be helpful for the current issue. You can refer to it. @CloudSen

#15270

Sep 20 '24 13:09 Gallardot

@Gallardot Yes, I’ve already applied patch #15270. But as I mentioned earlier, the state change is always a process. Opening a new transaction only speeds up the state update, but the same issue still exists in concurrent scenarios. A synchronizer is still needed for serially running workflows.

Sep 22 '24 08:09 CloudSen

@Gallardot Here’s another example for saveSerialProcess: In concurrent scenarios (multiple parent workflows with the same scheduling cycle reference the same sub-workflow), multiple start_workflow commands will be consumed by different masters at the same time. Then, in saveSerialProcess, each will detect that there are no running instances simultaneously(transaction have not been committed yet), and multiple start events for the workflow will be triggered at the same time. This ultimately leads to multiple workflow instances running concurrently, which is incorrect when they are supposed to be processed serially.

Sep 22 '24 09:09 CloudSen

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

Oct 23 '24 00:10 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

Oct 30 '24 00:10 github-actions[bot]

@CloudSen Do you already have any workaround to solve this issue?

Feb 05 '25 09:02 hujian0401

This issue doesn's fixed, I reopen.

Feb 06 '25 09:02 ruanwenjun

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

Mar 09 '25 00:03 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

Mar 16 '25 00:03 github-actions[bot]

dolphinscheduler dolphinscheduler copied to clipboard

[Bug] [MASTER] When Serial Wait workflow are frequently scheduled, serially waiting workflow might get stuck

Search before asking

What happened

What you expected to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

Search before asking

What happened

What you expected to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

dolphinscheduler
dolphinscheduler copied to clipboard