Expose the "message" of a pod from argo get/watch to subsequent steps

Open samath117 opened this issue 5 years ago • 5 comments

Summary

An Argo variable {{steps.X.message}} (or maybe {{steps.X.outputs.message}}) should include the message information that is displayed in argo get / argo watch so later steps can act accordingly.

Motivation

I work with nodes with sometimes fail for a variety of reasons: failed with exit code 1, pod deleted, the node had condition: [MemoryPressure], and more. I would like expose this message to later steps for proper processing; for instance, if the message is pod deleted, I would like to run it again, but I currently have to handle that sort of logic outside the workflow.

Proposal

This information seems accessible, since it's included in argo get. It seems simple to annotate it like with the exitCode.

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

May 27 '20 16:05 samath117

Can you use retrystrategy to rerun if it isPoddelete?

May 27 '20 16:05 sarabala1979

I'm not entirely sure, but I don't think so. I don't see anything in the docs relating the retryStrategy to pod deletion specifically, just to errors or failures.

May 27 '20 17:05 samath117

So you don't want to retry for all errors or failures. you want condition-based retrystrategy based on message or exit-code or some other variables. Am I understand your use-case correctly?

retryStrategy:
      limit: 2
      retryPolicy: "condition" 
      retryCondition: "{{steps.X.message}} == "Pod Deleted"

retryStrategy:
      limit: 2
      retryPolicy: "condition" 
      retryCondition: "{{steps.X.exitcode}} == 2

May 27 '20 17:05 sarabala1979

Something like that. I was originally imagining that I'd have to put it together into a recursive loop, a "manual retryStrategy", if you will, since retryCondition doesn't exist yet in the way that you've described it here. Alternatively, I may just want to log that message somewhere for further processing later.

May 27 '20 17:05 samath117

Would like to bump this - it would be very useful to know the specific reason a step/task failed. In my use case, if a task fails from memory pressure, I’d like to run the task again but with a different node group that has higher memory capacity. This could be controlled eg with a “when” condition that checks “tasks.myTask.message”

Dec 07 '21 16:12 samuelBB