dolphinscheduler
dolphinscheduler copied to clipboard
[Bug] [TaskExecutionRunnable] Sub workflow task always repeat run in master-server failover
Search before asking
- [X] I had searched in the issues and found no similar issues.
What happened
Normally, FailoverCoordinator.getFailoverWorkflowsForMaster() finds all workflows that need failover. When the sub-workflow task's TaskExecutionRunnable.failover() method is called, takeOverTaskFromExecutor() now returns false if the task is a logic task. This results in the creation of a new sub-workflow task instance and the publication of its TaskStartLifecycleEvent, causing the sub-workflow to run again during the failover process.
https://github.com/apache/dolphinscheduler/blob/071994933b05850e7dd5f7bccbf45d867640e244/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/runnable/TaskExecutionRunnable.java#L120-L135
https://github.com/apache/dolphinscheduler/blob/071994933b05850e7dd5f7bccbf45d867640e244/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/engine/task/runnable/TaskExecutionRunnable.java#L160-L164
so i think it would be better to check the sub-workflow instance properly(by dao or server communicate) and take it over, instead of creating a whole new task instance.
What you expected to happen
Take over sub-workflow task if the sub-workflow instance is in good status.
How to reproduce
execute a workflow with sub-workflow, restart the master-server, query the database
Anything else
No response
Version
dev
Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
The sub-workflow task is a special task type. During the fault tolerance period, all task types will generate new task instances, which is a unified operation. So this is not a bug.
The
sub-workflowtask is a special task type. During the fault tolerance period, all task types will generate new task instances, which is a unified operation. So this is not a bug.
agree that too, perhaps we can make some improvement to avoid task repeating execute in future.
agree that too, perhaps we can make some improvement to avoid task repeating execute in future.
Task instances usually run as Linux processes, and the same process id may be occupied by other programs after downtime. So it's difficult to achieve.
agree that too, perhaps we can make some improvement to avoid task repeating execute in future.
Task instances usually run as Linux processes, and the same process id may be occupied by other programs after downtime. So it's difficult to achieve.
yes, but i mean, the sub-workflow is already be takeover in master's failover process, it just invisible and still running in background, then the father-workflow in failover process, will rerun's the sub-workflow again(by the sub-workflow task), maybe the original workflow and the new workflow are both running in time.
@SbloodyS I'm sorry if my previous description was unclear. This issue is mainly about fault tolerance in sub-workflow.
Below is a detailed reproduce.
there is the schedule tree:
MAIN_WORKFLOW
SUB_WORKFLOW
STEP1->STEP2->STEP3->STEP4->STEP5
- execute MAIN_WORKFLOW
- stop and start master-server 2 times :
then the SUB_WORKFLOW executed 3 times, and are all running and be controlled in master:
finally, they are all finished:
i tried to create a branch to fix it : https://github.com/reele/dolphinscheduler/compare/dev-usable-fix-all...reele:dolphinscheduler:fix-takeover-wf?expand=1
and that works well.
@reele Hi, the subworkflow take-over logic is under SubWorkflowLogicTask, once we failover a subworkflow task, will generate a new logic task instance, the new task instance will contains the origin SubWorkflowLogicTaskRuntimeContext and then can track the origin sub workflow instance.
I will test this.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.
@ruanwenjun In dev branch now, there's a better way to takeover sub-workflow task in failover, i made a branch to implement:
https://github.com/apache/dolphinscheduler/compare/dev...reele:fix-takeover-workflow-in-failover?expand=1
after refactored TaskExecutorClient, takeover is called before initializeTaskExecutionContext, so there should only check workflow instance in takeover, and put the TaskExecutor to taskExecutorRepository in TaskEngine, and it will work for good.
@ruanwenjun In dev branch now, there's a better way to takeover sub-workflow task in failover, i made a branch to implement:
https://github.com/apache/dolphinscheduler/compare/dev...reele:fix-takeover-workflow-in-failover?expand=1
after refactored
TaskExecutorClient, takeover is called before initializeTaskExecutionContext, so there should only check workflow instance in takeover, and put theTaskExecutortotaskExecutorRepositoryinTaskEngine, and it will work for good.
Can we move the logic to SubWorkflowLogicTask ? when we failover a task instance(e.g. sub workflow task instance), we directly generate a new task instance which contains the pre runtime context, once the task instance started, it will check if it is a failover task instance, and if it can take-over.
@ruanwenjun In dev branch now, there's a better way to takeover sub-workflow task in failover, i made a branch to implement: https://github.com/apache/dolphinscheduler/compare/dev...reele:fix-takeover-workflow-in-failover?expand=1 after refactored
TaskExecutorClient, takeover is called before initializeTaskExecutionContext, so there should only check workflow instance in takeover, and put theTaskExecutortotaskExecutorRepositoryinTaskEngine, and it will work for good.
Can we move the logic to SubWorkflowLogicTask ? when we failover a task instance(e.g. sub workflow task instance), we directly generate a new task instance which contains the pre runtime context, once the task instance started, it will check if it is a failover task instance, and if it can take-over. Since different logic task instances might have its own logic to judge whether it can take-over. You can find a todo at SubWorkflowLogicTask#initializeSubWorkflowInstance
Can we move the logic to
SubWorkflowLogicTask? when we failover a task instance(e.g. sub workflow task instance), we directly generate a new task instance which contains the pre runtime context, once the task instance started, it will check if it is a failover task instance, and if it can take-over. Since different logic task instances might have its own logic to judge whether it can take-over. You can find a todo atSubWorkflowLogicTask#initializeSubWorkflowInstance
i'll have a try.
@ruanwenjun In dev branch now, there's a better way to takeover sub-workflow task in failover, i made a branch to implement: https://github.com/apache/dolphinscheduler/compare/dev...reele:fix-takeover-workflow-in-failover?expand=1 after refactored
TaskExecutorClient, takeover is called before initializeTaskExecutionContext, so there should only check workflow instance in takeover, and put theTaskExecutortotaskExecutorRepositoryinTaskEngine, and it will work for good.Can we move the logic to
SubWorkflowLogicTask? when we failover a task instance(e.g. sub workflow task instance), we directly generate a new task instance which contains the pre runtime context, once the task instance started, it will check if it is a failover task instance, and if it can take-over. Since different logic task instances might have its own logic to judge whether it can take-over. You can find a todo atSubWorkflowLogicTask#initializeSubWorkflowInstance
it's indeed a simpler approach - my method impacted architecture design, i tried.
Also, during failover, the master directly starts sub-workflows, which creates a split from SubWorkflowLogicTask's processing flow, i think it would be better to have SubWorkflowLogicTask fully handle triggering fault-tolerant sub-workflows.
@reele Yes, we can also exposed more method on TaskPlugin e.g. isFailoverInstance, doFailover but this will need to refactor the task SPI, so the simple way is deal with this in each plugin's start method.
@ruanwenjun i made a new: https://github.com/apache/dolphinscheduler/compare/dev...reele:dolphinscheduler:improvement-takeover-subworkflow-in-subworkflow-task?expand=1
@ruanwenjun i made a new: https://github.com/apache/dolphinscheduler/compare/dev...reele:dolphinscheduler:improvement-takeover-subworkflow-in-subworkflow-task?expand=1
@reele Great, LGTM, please submit a PR.