dolphinscheduler icon indicating copy to clipboard operation
dolphinscheduler copied to clipboard

[Bug] [MASTER] When Serial Wait workflow are frequently scheduled, serially waiting workflow might get stuck

Open CloudSen opened this issue 1 year ago • 11 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

I think the state synchronization mechanism has critical errors. 状态的转换是一个过程,但是现在流程之间的通知机制依赖瞬时状态。 State transition is a process, but currently, the notification mechanism between processes relies on instantaneous states. 在此,举一个例子,如果工作流也有超时配置,这个例子里,工作流超时后,下一个串行等待的工作流可能会卡住。 For example, let's imagine if the workflows also had timeout configurations,in this scenario, if a workflow times out, the next serially waiting workflow might get stuck.

@startuml
StateWheelExecuteThread --> WorkflowTimeoutStateEventHandler: send process A PROCESS_TIMEOUT event
WorkflowTimeoutStateEventHandler -> WorkflowExecuteRunnable: processTimeout
WorkflowExecuteRunnable --> WorkflowStateEventHandler: send process A STOP event
WorkflowStateEventHandler -> WorkflowExecuteRunnable: endProcess
WorkflowExecuteRunnable -> WorkflowExecuteRunnable:checkSerialProcess
WorkflowExecuteRunnable --> ProcessServiceImpl: send process B RECOVER_SERIAL_WAIT command
ProcessServiceImpl -> ProcessServiceImpl: STEP 1: handleCommand(if state of process A is RUNNING_PROCESS_STATE, state of proces B will change back to SERIAL_WAIT)
WorkflowStateEventHandler -> WorkflowStateEventHandler: STEP 2: update process A to STOP
@enduml

image

由于STEP 1和STEP 2的顺序无法保证,会导致后续所有实例都卡在“串行等待”状态。 Since the order of STEP 1 and STEP 2 cannot be guaranteed now, all subsequent instances might get stuck in the "serial wait" state.

What you expected to happen

恢复后续实例前,需要保证自己的状态更新完毕 Before resuming subsequent instances, you need to ensure that your own state has been fully updated.

How to reproduce

  • Create Workflow A and Workflow B, both scheduled to run every minute, and use Serial Wait execution type.
  • In Workflow A, create a SUB_PROCESS task node that references Workflow B.
  • Online both Workflow A and Workflow B, and online the schedules.
  • Observe the state changes.

Anything else

No response

Version

3.1.1

Are you willing to submit PR?

  • [] Yes I am willing to submit a PR!

Code of Conduct

CloudSen avatar Jul 25 '24 06:07 CloudSen

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

State transition is a process, but now the notification mechanism between processes relies on transient state. State transition is a process, but currently, the notification mechanism between processes relies on instantaneous states. Here, take an example, if the workflow also has a timeout configuration, in this example, after the workflow times out, the next serially waiting workflow may be stuck. For example, let's imagine if the workflows also had timeout configurations, in this scenario, if a workflow times out, the next serially waiting workflow might get stuck.

@startuml
StateWheelExecuteThread --> WorkflowTimeoutStateEventHandler: send process A PROCESS_TIMEOUT event
WorkflowTimeoutStateEventHandler -> WorkflowExecuteRunnable: processTimeout
WorkflowExecuteRunnable --> WorkflowStateEventHandler: send process A STOP event
WorkflowStateEventHandler -> WorkflowExecuteRunnable: endProcess
WorkflowExecuteRunnable -> WorkflowExecuteRunnable:checkSerialProcess
WorkflowExecuteRunnable --> ProcessServiceImpl: send process B RECOVER_SERIAL_WAIT command
ProcessServiceImpl -> ProcessServiceImpl: STEP 1: handleCommand(if state of process A is RUNNING_PROCESS_STATE, state of process B will change back to SERIAL_WAIT)
WorkflowStateEventHandler -> WorkflowStateEventHandler: STEP 2: update process A to STOP
@enduml

image

Since the order of STEP 1 and STEP 2 cannot be guaranteed, all subsequent instances will be stuck in the "serial waiting" state. Since the order of STEP 1 and STEP 2 cannot be guaranteed now, all subsequent instances might get stuck in the "serial wait" state.

image

What you expected to happen

Before restoring subsequent instances, you need to ensure that your status has been updated. Before resuming subsequent instances, you need to ensure that your own state has been fully updated.

How to reproduce

  • Create Workflow A and Workflow B, both scheduled to run every minute.
  • In Workflow A, create a SUB_PROCESS task node that references Workflow B.
  • Online both Workflow A and Workflow B, and online the schedules.
  • Observe the state changes.

Anything else

No response

Version

dev

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

github-actions[bot] avatar Jul 25 '24 06:07 github-actions[bot]

cc @ruanwenjun

SbloodyS avatar Jul 25 '24 07:07 SbloodyS

Maybe we need a cache queue for the RECOVER_SERIAL_WAIT command? If the command execution finds that the previous workflow has not been updated to a completed state in time, it should be put back in the queue and checked again after a while. When the previous workflow state has been fully updated, this command will successfully consume from the queue and transition from the SERIAL_WAIT state to the SUBMITTED_SUCCESS state.

CloudSen avatar Jul 26 '24 01:07 CloudSen

+1

ZhaoRuidong avatar Aug 07 '24 03:08 ZhaoRuidong

Currently, I resolved this issue by retrying the RECOVER_SERIAL_WAIT command when the previous workflow was not fully updated, and it works well.

CloudSen avatar Aug 09 '24 05:08 CloudSen

Right now the serial wait implementation is really unstable, there are a lot of case will cause it doesn't work well. e.g.

  1. Concurrent trigger will cause multiple workflow instance running which should in serial wait.
  2. Notify failed might cause the origin workflow instance cannot finish.
  3. The workflow should deal with the notify logic, this make the workflow instance state transition more complex.

It's better to refactor this, use a global SerialWaitCoordinator to notify the serial wait workflow instance, the origin workflow instance don't need to care whether it need to do notification.

ruanwenjun avatar Aug 18 '24 14:08 ruanwenjun

Right now the serial wait implementation is really unstable, there are a lot of case will cause it doesn't work well. e.g.

  1. Concurrent trigger will cause multiple workflow instance running which should in serial wait.
  2. Notify failed might cause the origin workflow instance cannot finish.
  3. The workflow should deal with the notify logic, this make the workflow instance state transition more complex.

It's better to refactor this, use a global SerialWaitCoordinator to notify the serial wait workflow instance, the origin workflow instance don't need to care whether it need to do notification.

Agree with your opinion, I have thought about a coordinator similar to SerialWaitCoordinator instead of notifying tasks themselves.

  • Is there a refactoring plan for this?
  • How should this bug in 3.1.x be fixed?

Retrying the RECOVER_SERIAL_WAIT command can only resolve instances where the next workflow state is already SERIAL_WAIT. However, it does not address cases where the next workflow instance is transitioning into a SERIAL_WAIT state.

CloudSen avatar Aug 19 '24 05:08 CloudSen

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Sep 20 '24 00:09 github-actions[bot]

In version 3.2.1, I fixed a similar issue, which might be helpful for the current issue. You can refer to it. @CloudSen

#15270

Gallardot avatar Sep 20 '24 13:09 Gallardot

@Gallardot Yes, I’ve already applied patch #15270. But as I mentioned earlier, the state change is always a process. Opening a new transaction only speeds up the state update, but the same issue still exists in concurrent scenarios. A synchronizer is still needed for serially running workflows.

CloudSen avatar Sep 22 '24 08:09 CloudSen

@Gallardot Here’s another example for saveSerialProcess: In concurrent scenarios (multiple parent workflows with the same scheduling cycle reference the same sub-workflow), multiple start_workflow commands will be consumed by different masters at the same time. Then, in saveSerialProcess, each will detect that there are no running instances simultaneously(transaction have not been committed yet), and multiple start events for the workflow will be triggered at the same time. This ultimately leads to multiple workflow instances running concurrently, which is incorrect when they are supposed to be processed serially.

CloudSen avatar Sep 22 '24 09:09 CloudSen

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Oct 23 '24 00:10 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Oct 30 '24 00:10 github-actions[bot]

@CloudSen Do you already have any workaround to solve this issue?

hujian0401 avatar Feb 05 '25 09:02 hujian0401

This issue doesn's fixed, I reopen.

ruanwenjun avatar Feb 06 '25 09:02 ruanwenjun

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Mar 09 '25 00:03 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Mar 16 '25 00:03 github-actions[bot]