the-littlest-jupyterhub icon indicating copy to clipboard operation
the-littlest-jupyterhub copied to clipboard

Setup OOMScoreAdj for hub & user services

Open yuvipanda opened this issue 7 years ago • 2 comments

We want user processes to be killed more frequently when system runs out of RAM, not system processes or hubs. This should normally 'just work', but as additional precatuions we should default to:

  1. 90% default memoryu limit on all spawned user servers
  2. Positive OOMAdj score on user services

Currently the default limit is 1G, and that should go.

yuvipanda avatar Jul 16 '18 23:07 yuvipanda

I needed to learn a bit more about this in order to review this issue, so I read this post which was nice: https://www.percona.com/blog/2019/08/02/out-of-memory-killer-or-savior/

I conclude that we define two services that are run by systemd, one for jupyterhub, and one for traefik. In those definitions, we can set things like OOMScoreAdjust to influence how the out-of-memory (OOM) killer acts. The OOM killer saves the entire system from crashing by choosen a process to kill to recover memory.

The gist of this issue is that we want to avoid killing the hub or traefik process, and would prefer that we kill user servers that are using a lot of memory.

Action points

  1. Figure out what positive OOMScoreAdjust setting we should have on our hub/traefik service definitions as found in tljh/systemd-units
  2. Configure our Spawner, based on the SystemdSpawner to the "90% default memory limit"

@yuvipanda, could you clarify a bit what you mean with "90% default memory limit", and perhaps you also know if this can be configured by the SystemdSpawner somehow, or if it is reasonable to implement this in systemd spawner?

consideRatio avatar Oct 22 '21 09:10 consideRatio

Thanks for digging into this, @consideRatio!

By default, I want to make it so that a single user can't use up all the memory on the system - at most 90% of memory. So someone accidentally creating a trillion element numpy array doesn't crash everything. So this just requires that some memory limit be in place for each user by default. But we don't want to be too restrictive, as we do want users to use memory available - that's why it is there. So my thought process was we'll limit users by default to a large limit, closer to size of the machine, but allow admins to set tighter limits. I don't know what the current default is, though.

mem_limit is definitely implemented in systemdspawner.

I think figuring out appropriate oomadjust for the hub and traefik services would be the next useful step here.

yuvipanda avatar Oct 22 '21 09:10 yuvipanda