robusta
robusta copied to clipboard
Restart job on OOM kill and give it higher memory limits/requests
Is your feature request related to a problem? When a job fails because of OOM error, sometimes the solution is to restart it with higher memory limits/requests.
Describe the solution you'd like Create an action in robusta that restarts a job and increasing its memory requests/limits.
Some considerations:
- To prevent an infinite loop of job's pod crashing and getting more memory, some solution is needed. For example, make a parameter for the actions for the "max" CPU/Memory allowed for the job restarting actions.
- There should be some way to specify the jobs the filter needs to work on by name. Make using a regex is best. However, maybe it can be done via the tigger itself.
- A finding should be created notifying about the job's requests/limits increase
Additional context
- Job failure trigger to be used in conjunction with this actions - here
I just wanted to know, Will the Max parameter takes max limit memory or max request memory?
I think it should be request
. Max mem/cpu request
For cpu, we recommend using only request
, and for memory we recommend request
=limit
So I think request
will be the right parameter