argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

`retryStrategy` with `nodeAntiAffinity` is not working as expected.

Open cwichka opened this issue 2 years ago • 26 comments

Summary

What happened/what you expected to happen?

I wanted to use retryStrategy with nodeAntiAffinity in order to prevent retrials from running on the same hosts. I was using the following small workflow for testing, but what happens is all retrials were started on the same host (node).

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: random-fail-
spec:
  entrypoint: random
  templates:
  - name: random
    retryStrategy:
      limit: 10
      retryPolicy: "Always"
      affinity:
        nodeAntiAffinity: {}
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import random
        import time
        random.seed(time.time())
        i = random.randint(0, 10)
        print(i)
        exit(i)`

Not sure if this is an expected behavior and in my case i should use RetryNodeAntiAffinity which is, as mentioned in the documentation, is a placeholder for future expansion.

What version are you running? Tested with v3.2.9 and v3.3.3


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

cwichka avatar Jul 20 '22 17:07 cwichka

#4679

alexec avatar Aug 02 '22 17:08 alexec

@goock could you look into this pls?

alexec avatar Aug 02 '22 17:08 alexec

@goock could you look into this pls?

I run the example from the ticket, and from above 30K feet, it looks like the controller stopped adding Affinity to the pod's spec, or it's removed at some point, I'll start debugging and find the cause.

goock avatar Aug 03 '22 08:08 goock

@alexec I started working on this. In this function https://github.com/argoproj/argo-workflows/blob/9d66b69f0bca92d7ef0c9aa67e87a2e334797530/workflow/controller/retry_tweak.go#L15 I'm taking the template from the boundary node and then finding the retry node with the same template, which works for more complicated scenarios but does not work for very simple ones from this ticket - the BoundaryID is empty.

I see four possible solutions to the issue, but I need advice on this:

  1. Modify FindRetryNode; if the BoundaryID is empty, take the template from the node and then find the retry node. I would need confirmation if TemplateName is not empty in all cases or at least when BoundaryID is empty.

  2. Create another traverse function (like this one https://github.com/argoproj/argo-workflows/blob/master/workflow/util/retry/retry.go#L10), walk over the nodes tree and find the retry node. This solution would be clumsy and inefficient.

  3. Add a Parent field to NodeStatus struct and just walk the tree upward to find the retry node. It makes everything super easy, but I'm unsure of possible drawbacks and if we want to introduce another field to the NodeStatus struct.

  4. Another way to find the retry node for the current one which I am not aware of.

goock avatar Sep 09 '22 12:09 goock

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Oct 01 '22 06:10 stale[bot]

Hi @goock @alexec, thank you for looking into the issue so far. I noticed it was automatically marked as stale. Do we have an update on the proposed solutions above please ?

cwichka avatar Oct 03 '22 10:10 cwichka

Hi @goock @alexec, thank you for looking into the issue so far. I noticed it was automatically marked as stale. Do we have an update on the proposed solutions above please ?

Hi @cwichka, thanks for replying. We do not have any proposal yet. I do not want to make it too complicated. I would appreciate it if you or someone familiar with the structures could take a look at my comment above and give me some advice and the right direction.

goock avatar Oct 03 '22 11:10 goock

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Oct 22 '22 20:10 stale[bot]

This is very much needed

remidebette avatar Oct 23 '22 17:10 remidebette

@remidebette @cwichka Is it not working for the script template only or all templates (Container)?

sarabala1979 avatar Nov 08 '22 18:11 sarabala1979

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Dec 31 '22 22:12 stale[bot]

Still needed

remidebette avatar Jan 05 '23 09:01 remidebette

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Jan 21 '23 20:01 stale[bot]

Hi, still needed. Any news on this?

remidebette avatar Jan 30 '23 11:01 remidebette

Hi, still needed. Any news on this?

Hey, I can still work on this but need help as described here https://github.com/argoproj/argo-workflows/issues/9193#issuecomment-1241901976

goock avatar Feb 01 '23 11:02 goock

@alexec would you be able to lend us your thoughts on the below questions @goock proposed above?

I see four possible solutions to the issue, but I need advice on this:

  1. Modify FindRetryNode; if the BoundaryID is empty, take the template from the node and then find the retry node. I would need confirmation if TemplateName is not empty in all cases or at least when BoundaryID is empty.

  2. Create another traverse function (like this one https://github.com/argoproj/argo-workflows/blob/master/workflow/util/retry/retry.go#L10), walk over the nodes tree and find the retry node. This solution would be clumsy and inefficient.

  3. Add a Parent field to NodeStatus struct and just walk the tree upward to find the retry node. It makes everything super easy, but I'm unsure of possible drawbacks and if we want to introduce another field to the NodeStatus struct.

  4. Another way to find the retry node for the current one which I am not aware of.

caelan-io avatar Mar 17 '23 14:03 caelan-io

This is very much needed

3072141364 avatar May 25 '23 07:05 3072141364

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this is a mentoring request, please provide an update here. Thank you for your contributions.

stale[bot] avatar Jun 18 '23 02:06 stale[bot]

Still needed

kevinnoel-be avatar Jun 20 '23 08:06 kevinnoel-be

Please don't add new fields to NodeStatus as each new fields needsO(n) extra storage when n is the number of nodes. It is better to traverse status and build a new structure if need. A one-time traverse would bo O(n).

alexec avatar Jul 03 '23 17:07 alexec

Thanks for the input @alexec !

@goock - do you still have capacity to work on this issue? We're trying to get a sense so we can prioritize it among other issues. Please do let us know if you have more questions or would like further guidance. Thank you.

caelan-io avatar Jul 04 '23 13:07 caelan-io

@caelan-io I would like to finish this ticket. ATM I'm busy but probably in the next 2-3 weeks I can find a spare time to work on this. The crucial part of this ticket and implementation is to find an efficient way to find the retry node as in my original question https://github.com/argoproj/argo-workflows/issues/9193#issuecomment-1241901976. Thanks.

goock avatar Jul 05 '23 10:07 goock

Thanks for the quick update @goock ! That sounds like a plan. Feel free to post if further questions/ clarification needed.

caelan-io avatar Jul 05 '23 11:07 caelan-io

Bumping this to hopefully get some headway (also, don't want it to be marked stale again. lol)

Interaze avatar Jul 28 '23 18:07 Interaze

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale[bot] avatar Sep 17 '23 03:09 stale[bot]

Any news on this ? 🙄

corentingiraud avatar Oct 09 '23 18:10 corentingiraud