dolphinscheduler
dolphinscheduler copied to clipboard
[Improvement][Master] Allow Recovery of Failed Tasks in Running Workflows
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
In DolphinScheduler's scheduling strategy where workflows continue after task failures, we encountered a limitation with the "Recovery Failed" feature. Specifically, if a task within a workflow fails, and other tasks are still running for a period of time, the "Recovery Failed" option becomes unavailable. We can only recover the workflow after the entire workflow fails, leading to delays in completing the failed task and its subsequent tasks.
For example, in the attached scenario (see image):
Task B1 has failed, while other tasks like A1 (which Workflow2 depends on) continue running. If we wait for Workflow1 to fail before recovering the failed task (B1), B1's completion will be delayed. However, if we terminate Workflow1 immediately and then recover it, the dependent workflow (Workflow2) would unnecessarily fail due to A1 being killed, requiring us to recover Workflow2 as well.
Proposed Feature: We suggest adding a feature that allows us to recover failed tasks within a running workflow. This would provide a way to proactively recover tasks like B1 before the entire workflow fails, giving workflows that would otherwise fail the opportunity to complete successfully.
This enhancement could save time and prevent cascading failures in dependent workflows. It would be particularly useful in scenarios where we can foresee a task's failure leading to the workflow’s eventual failure.
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct