dolphinscheduler icon indicating copy to clipboard operation
dolphinscheduler copied to clipboard

[Bug] [Master] Post node of switch task cannot be triggered in special case

Open kchen-shanghai opened this issue 10 months ago • 5 comments

Search before asking

  • [x] I had searched in the issues and found no similar issues.

What happened

We are working on v3.2.1, and we have DAG like below: Image The "switch_channel" node switch either to "echo_teriri" or "echo_teriteri" node. But "echo_teriteri" is also the post node of "echo_teriri" node. What matters is that node "echo_teriteri" cannot be triggered as node "switch_channel" is treated as its parent and the depend result is set to failure in WorkflowExecuteRunnable#dependTaskSuccess (which is not). We can have somthing as a workaround like: Image But may be it is better to have it fixed in codes? Thanks a lot for anyone who can help.

What you expected to happen

Node "echo_teriteri" should be triggered normally.

How to reproduce

100 percent reproduced in DAG like below: Image

Anything else

No response

Version

3.2.x

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

Code of Conduct

kchen-shanghai avatar Mar 08 '25 07:03 kchen-shanghai

A task is executed only if all upstream edges are reachable, in your dag, the switch doesn't want echo_teriteri executed.

ruanwenjun avatar Mar 13 '25 02:03 ruanwenjun

A task is executed only if all upstream edges are reachable, in your dag, the switch doesn't want echo_teriteri executed.

You're correct that in this DAG, the switch does not want echo_teriteri executed.

However, the key question here is: Should switch_channel be considered an upstream dependency of echo_teriteri if the condition for the echo_teriteri branch is not met?

If switch_channel should indeed be counted as an upstream dependency, then based on the logic that "A task is executed only if all upstream edges are reachable," the second DAG should not execute echo_teriteri because do_nothing is unreachable. However, this is not the current system behavior.

Let’s break it down:

First DAG:

  1. switch_channel wants echo_teriri executed.
  2. echo_teriri wants echo_teriteri executed.
  3. switch_channel does not want echo_teriteri executed.
  4. Steps 2 and 3 conflict. Since "A task is executed only if all upstream edges are reachable," echo_teriteri is not executed.

Second DAG:

  1. switch_channel wants echo_teriri executed.
  2. echo_teriri wants echo_teriteri executed.
  3. switch_channel does not want do_nothing executed.
  4. do_nothing is not executed, which should block echo_teriteri.
  5. Steps 2 and 4 conflict. However, contrary to the stated logic, echo_teriteri is still executed.

This inconsistency is problematic.

Most importantly:

  • If the system is designed to prevent echo_teriteri from executing in the first DAG, the workflow instance should end normally.
  • Instead, the instance ends in FAIL status despite no failed tasks being present.
  • Restarting the instance leads to a zombie process that keeps running indefinitely and cannot be killed.

kchen-shanghai avatar Mar 13 '25 06:03 kchen-shanghai

A task is executed only if all upstream edges are reachable, in your dag, the switch doesn't want echo_teriteri executed.

You're correct that in this DAG, the switch does not want echo_teriteri executed.

However, the key question here is: Should switch_channel be considered an upstream dependency of echo_teriteri if the condition for the echo_teriteri branch is not met?

If switch_channel should indeed be counted as an upstream dependency, then based on the logic that "A task is executed only if all upstream edges are reachable," the second DAG should not execute echo_teriteri because do_nothing is unreachable. However, this is not the current system behavior.

Let’s break it down:

First DAG:

  1. switch_channel wants echo_teriri executed.
  2. echo_teriri wants echo_teriteri executed.
  3. switch_channel does not want echo_teriteri executed.
  4. Steps 2 and 3 conflict. Since "A task is executed only if all upstream edges are reachable," echo_teriteri is not executed.

Second DAG:

  1. switch_channel wants echo_teriri executed.
  2. echo_teriri wants echo_teriteri executed.
  3. switch_channel does not want do_nothing executed.
  4. do_nothing is not executed, which should block echo_teriteri.
  5. Steps 2 and 4 conflict. However, contrary to the stated logic, echo_teriteri is still executed.

This inconsistency is problematic.

Most importantly:

  • If the system is designed to prevent echo_teriteri from executing in the first DAG, the workflow instance should end normally.
  • Instead, the instance ends in FAIL status despite no failed tasks being present.
  • Restarting the instance leads to a zombie process that keeps running indefinitely and cannot be killed.

In the second DAG do_nothing will be removed from the execution graph after the switch_channel executed, so it will not affect echo_teriteri. And for the problem, you can test on dev, the whole logic has been refactored, I am not sure if this problem still exist, if exist we should fix this.

ruanwenjun avatar Mar 13 '25 09:03 ruanwenjun

I will fix this in 3.3.0, echo_teriteri should be executed in the first DAG.

ruanwenjun avatar Mar 13 '25 09:03 ruanwenjun

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Apr 14 '25 00:04 github-actions[bot]