robusta icon indicating copy to clipboard operation
robusta copied to clipboard

Restart job on OOM kill and give it higher memory limits/requests

Open Sheeproid opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? When a job fails because of OOM error, sometimes the solution is to restart it with higher memory limits/requests.

Describe the solution you'd like Create an action in robusta that restarts a job and increasing its memory requests/limits.

Some considerations:

  • To prevent an infinite loop of job's pod crashing and getting more memory, some solution is needed. For example, make a parameter for the actions for the "max" CPU/Memory allowed for the job restarting actions.
  • There should be some way to specify the jobs the filter needs to work on by name. Make using a regex is best. However, maybe it can be done via the tigger itself.
  • A finding should be created notifying about the job's requests/limits increase

Additional context

  • Job failure trigger to be used in conjunction with this actions - here

Sheeproid avatar Sep 18 '22 14:09 Sheeproid

I just wanted to know, Will the Max parameter takes max limit memory or max request memory?

wahajXgrid avatar Sep 27 '22 11:09 wahajXgrid

I think it should be request. Max mem/cpu request For cpu, we recommend using only request, and for memory we recommend request=limit So I think request will be the right parameter

arikalon1 avatar Sep 28 '22 06:09 arikalon1