Conditional using global parameter that changes causes workflow to freeze
Summary
Consider the following workflow:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: changing-conditional-
spec:
entrypoint: plan
templates:
- name: plan
steps:
- - name: set-global-parameter
template: global-parameter-setting
arguments:
parameters:
- name: setting
value: true
- - name: change-global-parameter-conditionally
template: nested-global-parameter-setting
when: "{{workflow.outputs.parameters.parameter}}"
arguments:
parameters:
- name: setting
value: false
- name: nested-global-parameter-setting
inputs:
parameters:
- name: setting
steps:
- - name: set-global-parameter
template: global-parameter-setting
arguments:
parameters:
- name: setting
value: "{{inputs.parameters.setting}}"
- name: global-parameter-setting
inputs:
parameters:
- name: setting
container:
image: alpine:3.12
command: [sh, -c]
args: ["echo '{{inputs.parameters.setting}}' > /parameter.txt"]
outputs:
parameters:
- name: global-parameter
valueFrom:
path: /parameter.txt
globalName: parameter
This workflow sets a global parameter to true in its first step, which is then used as a condition to run the second nested step. If the second step runs, it changes the value of the global parameter to false.
However, when this workflow runs, it freezes on the second step, despite having completed its substep. Here is the argo get after it froze for about 30 seconds:
Name: changing-conditional-rsq2f
Namespace: virtual
ServiceAccount: default
Status: Running
Created: Tue Jan 12 16:03:11 -0500 (41 seconds ago)
Started: Tue Jan 12 16:03:11 -0500 (41 seconds ago)
Duration: 41 seconds
Progress: 2/2
ResourcesDuration: 3s*(1 cpu),3s*(100Mi memory)
Output Parameters:
parameter: false
STEP TEMPLATE PODNAME DURATION MESSAGE
● changing-conditional-rsq2f plan
├───✔ set-global-parameter global-parameter-setting changing-conditional-rsq2f-1345312722 2s
└───● change-global-parameter-conditionally nested-global-parameter-setting
└───✔ set-global-parameter global-parameter-setting changing-conditional-rsq2f-1327142919 3s
In addition, this is the sort of freezing that prevents argo terminate from changing anything -- the workflows still show up on argo list -n virtual --running.
Now for the variations.
- If we change the second setting from
falsetotrue(so it wouldn't actually change the global parameter), this completes successfully, which suggests that this is related to the conditional no longer being satisfied with an updated reference. - If we make the second step not nested, then it also works correctly, so this seems to be a particular problem with the composite template here.
- If we replace the single-step
stepstemplate here with a single-taskdagtemplate, the same error happens. - If we make the first template nested instead, we encounter a different bug: The workflow now complains that it can't resolve
{{workflow.outputs.parameters.parameter}}, as if it's unable to see that the nested template can also set the global parameter. I'd be happy to spin this out as another issue if it isn't related.
Diagnostics
I'm running the just-released Argo v2.12.4. I also backtested and this freezing bug also occurs in v2.12.3, v2.12.2, v2.11.8, v2.10.2, v2.9.5, and v2.8.2. (In some of these older versions, you need to replace the boolean values with strings.)
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
I identified the problem, will work on a fix
The issue here is that when the last pod completes, the workflow operator revisits all the nodes in the workflow in order to actually mark them as complete and finish the workflow. When it arrives at change-global-parameter-conditionally, however, it sees that when: false and no longer dives in as it did previously when the value was still true. Since it no longer dives to child nodes, then those nodes are not marked as Succeeded, and therefore change-global-parameter-conditionally is not marked as Succeeded, keeping the workflow in running state.
To be honest, I'm not sure if this is a bug... from my perspective a step that was initially allowed to execute all of a sudden is no longer allowed, even though it has already completed. This will naturally create undefined behavior.
A couple of questions:
- Could you speak a little about the use case of having a global variable in a
whencondition, and having that variable -- which affects parent nodes -- change during runtime by a child node? - What would be your ideal behavior? The children nodes marked as Succeeded even though now, after the variable was changed, they were not supposed to be run?
Thanks for looking into this! I see how the bug is happening now and there does appear to be some consistent logic being used.
That said, I initially expected the when conditional to be evaluated only once, when the execution of that step is started. This would just require the workflow controller it to copy/freeze its reference to the global variable to its current value. To turn this around a bit, is there any good reason to make the when condition boolean update dynamically? Are there use cases where we would actually want that? If this isn't a bug, do we actually want these workflows to stall out like this?
My use case is that we have a global variable to record the overall state of the workflow, which you can think of as being either a continue or abort state (some templates have cleanup processes in run if the abort state is entered, so this isn't just the same as using argo terminate or argo stop).
Several templates I have include substeps that check if something went wrong and can update the state to abort. I don't want these templates to execute in the first place if the workflow has already been aborted, so I add the conditional to the top-level template.
I'll be using the workaround of moving the conditionals down to the leaves, but it'd be nice (both for the presentation in argo get and keeping the templates simpler) to leave them at the highest applicable level.
That said, I initially expected the when conditional to be evaluated only once, when the execution of that step is started
This is a good point on its own, I'll look into it
Actually, the issue with that approach is that we, on principle, don't modify anything in the spec: of a workflow, and to evaluate once we would need to do so or keep the state in the status:, which I don't think is a good approach.
Another idea is to simply ignore the when: if the node already exists. There was a PR that did this recently for DAGs (https://github.com/argoproj/argo/pull/4683), so it might be worth exploring
I'll take a look