argo-workflows icon indicating copy to clipboard operation
argo-workflows copied to clipboard

Add previous resource request to `lastRetry`

Open EladProject opened this issue 1 year ago • 4 comments

Summary

Add the previous run memory request to lastRetry

Use Cases

I'd like to be able to better control the memory request on retries. Sometimes I don't need to increase it (when the pod is evicted from a spot instance for example). On other occasions (OOMKilled) I'd like to exponentially increase it. Knowing the previous resource request along with https://github.com/argoproj/argo-workflows/pull/12722 should make this possible.


Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.

EladProject avatar Mar 27 '24 08:03 EladProject

Do you really need the previous resource request? Can't you just use the exit code to know if the last node failed with OOM and only increase the memory in those cases?

eduardodbr avatar Mar 30 '24 20:03 eduardodbr

Are you talking about this: https://github.com/argoproj/argo-workflows/pull/12722 ?

It' still open, no?

Anyway, there are situations where this won't be enough: Let's say that a node was preempted after being retried for OOM. In this case I'd like to retry (for the 3rd time) with the last memory request (which is not the original request, because it was increased after the OOM).

I think that adding the last retry memory together with https://github.com/argoproj/argo-workflows/pull/12722 will do the trick.

EladProject avatar Mar 31 '24 08:03 EladProject

You can calculate it based upon the retry number. There is no need to read the previous value to do the calculations for any mathematical sequence whether it is linear or exponential.

Joibel avatar Mar 31 '24 09:03 Joibel

Retry number increases regardless whether the retry is because of OOM or another reason. So there is no way to calculate with certainty the memory of the previous run (unless there is some way to access the exit codes for all the previous retries).

EladProject avatar Mar 31 '24 11:03 EladProject