argo-workflows
argo-workflows copied to clipboard
`retryStrategy` with `nodeAntiAffinity` is not working as expected.
Summary
What happened/what you expected to happen?
I wanted to use retryStrategy
with nodeAntiAffinity
in order to prevent retrials from running on the same hosts. I was using the following small workflow for testing, but what happens is all retrials were started on the same host (node).
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: random-fail-
spec:
entrypoint: random
templates:
- name: random
retryStrategy:
limit: 10
retryPolicy: "Always"
affinity:
nodeAntiAffinity: {}
script:
image: python:alpine3.6
command: [python]
source: |
import random
import time
random.seed(time.time())
i = random.randint(0, 10)
print(i)
exit(i)`
Not sure if this is an expected behavior and in my case i should use RetryNodeAntiAffinity
which is, as mentioned in the documentation, is a placeholder for future expansion.
What version are you running? Tested with v3.2.9 and v3.3.3
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
#4679
@goock could you look into this pls?
@goock could you look into this pls?
I run the example from the ticket, and from above 30K feet, it looks like the controller stopped adding Affinity
to the pod's spec, or it's removed at some point, I'll start debugging and find the cause.
@alexec I started working on this.
In this function https://github.com/argoproj/argo-workflows/blob/9d66b69f0bca92d7ef0c9aa67e87a2e334797530/workflow/controller/retry_tweak.go#L15 I'm taking the template from the boundary node and then finding the retry node with the same template, which works for more complicated scenarios but does not work for very simple ones from this ticket - the BoundaryID
is empty.
I see four possible solutions to the issue, but I need advice on this:
-
Modify
FindRetryNode
; if theBoundaryID
is empty, take the template from the node and then find the retry node. I would need confirmation ifTemplateName
is not empty in all cases or at least whenBoundaryID
is empty. -
Create another traverse function (like this one https://github.com/argoproj/argo-workflows/blob/master/workflow/util/retry/retry.go#L10), walk over the nodes tree and find the retry node. This solution would be clumsy and inefficient.
-
Add a
Parent
field toNodeStatus struct
and just walk the tree upward to find the retry node. It makes everything super easy, but I'm unsure of possible drawbacks and if we want to introduce another field to theNodeStatus
struct. -
Another way to find the retry node for the current one which I am not aware of.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Hi @goock @alexec, thank you for looking into the issue so far. I noticed it was automatically marked as stale. Do we have an update on the proposed solutions above please ?
Hi @goock @alexec, thank you for looking into the issue so far. I noticed it was automatically marked as stale. Do we have an update on the proposed solutions above please ?
Hi @cwichka, thanks for replying. We do not have any proposal yet. I do not want to make it too complicated. I would appreciate it if you or someone familiar with the structures could take a look at my comment above and give me some advice and the right direction.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
This is very much needed
@remidebette @cwichka Is it not working for the script template only or all templates (Container)?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Still needed
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Hi, still needed. Any news on this?
Hi, still needed. Any news on this?
Hey, I can still work on this but need help as described here https://github.com/argoproj/argo-workflows/issues/9193#issuecomment-1241901976
@alexec would you be able to lend us your thoughts on the below questions @goock proposed above?
I see four possible solutions to the issue, but I need advice on this:
Modify FindRetryNode; if the BoundaryID is empty, take the template from the node and then find the retry node. I would need confirmation if TemplateName is not empty in all cases or at least when BoundaryID is empty.
Create another traverse function (like this one https://github.com/argoproj/argo-workflows/blob/master/workflow/util/retry/retry.go#L10), walk over the nodes tree and find the retry node. This solution would be clumsy and inefficient.
Add a Parent field to NodeStatus struct and just walk the tree upward to find the retry node. It makes everything super easy, but I'm unsure of possible drawbacks and if we want to introduce another field to the NodeStatus struct.
Another way to find the retry node for the current one which I am not aware of.
This is very much needed
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.
Still needed
Please don't add new fields to NodeStatus
as each new fields needsO(n)
extra storage when n
is the number of nodes. It is better to traverse status
and build a new structure if need. A one-time traverse would bo O(n)
.
Thanks for the input @alexec !
@goock - do you still have capacity to work on this issue? We're trying to get a sense so we can prioritize it among other issues. Please do let us know if you have more questions or would like further guidance. Thank you.
@caelan-io I would like to finish this ticket. ATM I'm busy but probably in the next 2-3 weeks I can find a spare time to work on this. The crucial part of this ticket and implementation is to find an efficient way to find the retry node as in my original question https://github.com/argoproj/argo-workflows/issues/9193#issuecomment-1241901976. Thanks.
Thanks for the quick update @goock ! That sounds like a plan. Feel free to post if further questions/ clarification needed.
Bumping this to hopefully get some headway (also, don't want it to be marked stale again. lol)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Any news on this ? 🙄