argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Conditional using global parameter that changes causes workflow to freeze

Open samath117 opened this issue 4 years ago • 6 comments

Summary

Consider the following workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: changing-conditional-
spec:
  entrypoint: plan
  templates:
  - name: plan
    steps:
    - - name: set-global-parameter
        template: global-parameter-setting
        arguments:
          parameters:
          - name: setting
            value: true

    - - name: change-global-parameter-conditionally
        template: nested-global-parameter-setting
        when: "{{workflow.outputs.parameters.parameter}}"
        arguments:
          parameters:
          - name: setting
            value: false

  - name: nested-global-parameter-setting
    inputs:
      parameters:
      - name: setting
    steps:
    - - name: set-global-parameter
        template: global-parameter-setting
        arguments:
          parameters:
          - name: setting
            value: "{{inputs.parameters.setting}}"

  - name: global-parameter-setting
    inputs:
      parameters:
      - name: setting
    container:
      image: alpine:3.12
      command: [sh, -c]
      args: ["echo '{{inputs.parameters.setting}}' > /parameter.txt"]
    outputs:
      parameters:
      - name: global-parameter
        valueFrom:
          path: /parameter.txt
        globalName: parameter

This workflow sets a global parameter to true in its first step, which is then used as a condition to run the second nested step. If the second step runs, it changes the value of the global parameter to false.

However, when this workflow runs, it freezes on the second step, despite having completed its substep. Here is the argo get after it froze for about 30 seconds:

Name:                changing-conditional-rsq2f
Namespace:           virtual
ServiceAccount:      default
Status:              Running
Created:             Tue Jan 12 16:03:11 -0500 (41 seconds ago)
Started:             Tue Jan 12 16:03:11 -0500 (41 seconds ago)
Duration:            41 seconds
Progress:            2/2
ResourcesDuration:   3s*(1 cpu),3s*(100Mi memory)
Output Parameters:
  parameter:         false

STEP                                          TEMPLATE                         PODNAME                                DURATION  MESSAGE
 ● changing-conditional-rsq2f                 plan
 ├───✔ set-global-parameter                   global-parameter-setting         changing-conditional-rsq2f-1345312722  2s
 └───● change-global-parameter-conditionally  nested-global-parameter-setting
     └───✔ set-global-parameter               global-parameter-setting         changing-conditional-rsq2f-1327142919  3s

In addition, this is the sort of freezing that prevents argo terminate from changing anything -- the workflows still show up on argo list -n virtual --running.

Now for the variations.

  • If we change the second setting from false to true (so it wouldn't actually change the global parameter), this completes successfully, which suggests that this is related to the conditional no longer being satisfied with an updated reference.
  • If we make the second step not nested, then it also works correctly, so this seems to be a particular problem with the composite template here.
  • If we replace the single-step steps template here with a single-task dag template, the same error happens.
  • If we make the first template nested instead, we encounter a different bug: The workflow now complains that it can't resolve {{workflow.outputs.parameters.parameter}}, as if it's unable to see that the nested template can also set the global parameter. I'd be happy to spin this out as another issue if it isn't related.

Diagnostics

I'm running the just-released Argo v2.12.4. I also backtested and this freezing bug also occurs in v2.12.3, v2.12.2, v2.11.8, v2.10.2, v2.9.5, and v2.8.2. (In some of these older versions, you need to replace the boolean values with strings.)


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

samath117 avatar Jan 12 '21 21:01 samath117

I identified the problem, will work on a fix

simster7 avatar Jan 13 '21 17:01 simster7

The issue here is that when the last pod completes, the workflow operator revisits all the nodes in the workflow in order to actually mark them as complete and finish the workflow. When it arrives at change-global-parameter-conditionally, however, it sees that when: false and no longer dives in as it did previously when the value was still true. Since it no longer dives to child nodes, then those nodes are not marked as Succeeded, and therefore change-global-parameter-conditionally is not marked as Succeeded, keeping the workflow in running state.

To be honest, I'm not sure if this is a bug... from my perspective a step that was initially allowed to execute all of a sudden is no longer allowed, even though it has already completed. This will naturally create undefined behavior.

A couple of questions:

  1. Could you speak a little about the use case of having a global variable in a when condition, and having that variable -- which affects parent nodes -- change during runtime by a child node?
  2. What would be your ideal behavior? The children nodes marked as Succeeded even though now, after the variable was changed, they were not supposed to be run?

simster7 avatar Jan 13 '21 19:01 simster7

Thanks for looking into this! I see how the bug is happening now and there does appear to be some consistent logic being used.

That said, I initially expected the when conditional to be evaluated only once, when the execution of that step is started. This would just require the workflow controller it to copy/freeze its reference to the global variable to its current value. To turn this around a bit, is there any good reason to make the when condition boolean update dynamically? Are there use cases where we would actually want that? If this isn't a bug, do we actually want these workflows to stall out like this?

My use case is that we have a global variable to record the overall state of the workflow, which you can think of as being either a continue or abort state (some templates have cleanup processes in run if the abort state is entered, so this isn't just the same as using argo terminate or argo stop).

Several templates I have include substeps that check if something went wrong and can update the state to abort. I don't want these templates to execute in the first place if the workflow has already been aborted, so I add the conditional to the top-level template.

I'll be using the workaround of moving the conditionals down to the leaves, but it'd be nice (both for the presentation in argo get and keeping the templates simpler) to leave them at the highest applicable level.

samath117 avatar Jan 13 '21 20:01 samath117

That said, I initially expected the when conditional to be evaluated only once, when the execution of that step is started

This is a good point on its own, I'll look into it

simster7 avatar Jan 13 '21 23:01 simster7

Actually, the issue with that approach is that we, on principle, don't modify anything in the spec: of a workflow, and to evaluate once we would need to do so or keep the state in the status:, which I don't think is a good approach.

Another idea is to simply ignore the when: if the node already exists. There was a PR that did this recently for DAGs (https://github.com/argoproj/argo/pull/4683), so it might be worth exploring

simster7 avatar Jan 13 '21 23:01 simster7

I'll take a look

simster7 avatar Jan 13 '21 23:01 simster7