[Bug] [Master] Post node of switch task cannot be triggered in special case
Search before asking
- [x] I had searched in the issues and found no similar issues.
What happened
We are working on v3.2.1, and we have DAG like below:
The "switch_channel" node switch either to "echo_teriri" or "echo_teriteri" node. But "echo_teriteri" is also the post node of "echo_teriri" node. What matters is that node "echo_teriteri" cannot be triggered as node "switch_channel" is treated as its parent and the depend result is set to failure in WorkflowExecuteRunnable#dependTaskSuccess (which is not).
We can have somthing as a workaround like:
But may be it is better to have it fixed in codes?
Thanks a lot for anyone who can help.
What you expected to happen
Node "echo_teriteri" should be triggered normally.
How to reproduce
100 percent reproduced in DAG like below:
Anything else
No response
Version
3.2.x
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
A task is executed only if all upstream edges are reachable, in your dag, the switch doesn't want echo_teriteri executed.
A task is executed only if all upstream edges are reachable, in your dag, the switch doesn't want
echo_teriteriexecuted.
You're correct that in this DAG, the switch does not want echo_teriteri executed.
However, the key question here is: Should switch_channel be considered an upstream dependency of echo_teriteri if the condition for the echo_teriteri branch is not met?
If switch_channel should indeed be counted as an upstream dependency, then based on the logic that "A task is executed only if all upstream edges are reachable," the second DAG should not execute echo_teriteri because do_nothing is unreachable. However, this is not the current system behavior.
Let’s break it down:
First DAG:
switch_channelwantsecho_teririexecuted.echo_teririwantsecho_teriteriexecuted.switch_channeldoes not wantecho_teriteriexecuted.- Steps 2 and 3 conflict. Since "A task is executed only if all upstream edges are reachable,"
echo_teriteriis not executed.
Second DAG:
switch_channelwantsecho_teririexecuted.echo_teririwantsecho_teriteriexecuted.switch_channeldoes not wantdo_nothingexecuted.do_nothingis not executed, which should blockecho_teriteri.- Steps 2 and 4 conflict. However, contrary to the stated logic,
echo_teriteriis still executed.
This inconsistency is problematic.
Most importantly:
- If the system is designed to prevent
echo_teriterifrom executing in the first DAG, the workflow instance should end normally. - Instead, the instance ends in FAIL status despite no failed tasks being present.
- Restarting the instance leads to a zombie process that keeps running indefinitely and cannot be killed.
A task is executed only if all upstream edges are reachable, in your dag, the switch doesn't want
echo_teriteriexecuted.You're correct that in this DAG, the switch does not want
echo_teriteriexecuted.However, the key question here is: Should
switch_channelbe considered an upstream dependency ofecho_teriteriif the condition for theecho_teriteribranch is not met?If
switch_channelshould indeed be counted as an upstream dependency, then based on the logic that "A task is executed only if all upstream edges are reachable," the second DAG should not executeecho_teriteribecausedo_nothingis unreachable. However, this is not the current system behavior.Let’s break it down:
First DAG:
switch_channelwantsecho_teririexecuted.echo_teririwantsecho_teriteriexecuted.switch_channeldoes not wantecho_teriteriexecuted.- Steps 2 and 3 conflict. Since "A task is executed only if all upstream edges are reachable,"
echo_teriteriis not executed.Second DAG:
switch_channelwantsecho_teririexecuted.echo_teririwantsecho_teriteriexecuted.switch_channeldoes not wantdo_nothingexecuted.do_nothingis not executed, which should blockecho_teriteri.- Steps 2 and 4 conflict. However, contrary to the stated logic,
echo_teriteriis still executed.This inconsistency is problematic.
Most importantly:
- If the system is designed to prevent
echo_teriterifrom executing in the first DAG, the workflow instance should end normally.- Instead, the instance ends in FAIL status despite no failed tasks being present.
- Restarting the instance leads to a zombie process that keeps running indefinitely and cannot be killed.
In the second DAG do_nothing will be removed from the execution graph after the switch_channel executed, so it will not affect echo_teriteri. And for the problem, you can test on dev, the whole logic has been refactored, I am not sure if this problem still exist, if exist we should fix this.
I will fix this in 3.3.0, echo_teriteri should be executed in the first DAG.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.