nebari icon indicating copy to clipboard operation
nebari copied to clipboard

[ENH] - Make idle culler settings easily configurable and documented how to change

Open costrouc opened this issue 2 years ago • 5 comments

Feature description

Currently much of the idle culler is hard coded. @rsignell-usgs brought this up as an issue that he was concerned about. The current timeout is too short in some cases.

Value and/or benefit

The default idle timeout does not work for everyone.

Anything else?

No response

costrouc avatar May 14 '22 03:05 costrouc

Hi @costrouc, do they want a config per user structure, or they are happy to get it set in the qhub-config?

viniciusdc avatar May 16 '22 13:05 viniciusdc

@viniciusdc and @costrouc , we would be happy to set this in the qhub-config.
One of the worst aspects of having the time out being so short is that any terminal sessions disappear. Thanks for taking a look!

rsignell-usgs avatar May 16 '22 18:05 rsignell-usgs

Folks, what would it take to enable this?

This is the number top complaint I've heard from ESIP Qhub users.

Even if it wasn't configurable and just made longer by qhub devs, that would be wonderful. Right now it must be 5 minutes, right?

It would be great if dask clusters spun down in 30 min, and notebooks spun down in 90 min or 3 hours.

Just for comparison, AWS SageMaker Studio Lab, the free notebook offering from AWS, times out after 4 hours for a GPU, 12 hours for a CPU.

rsignell-usgs avatar Jul 27 '22 17:07 rsignell-usgs

Hi @rsignell-usgs, I will make sure this issue is prioritized for our next sprint (which starts next week). I can't promise it will be configurable from the qhub-config.yaml but I will work with the team to come up with a workable solution asap. Thanks again for the reminder!!

iameskild avatar Jul 29 '22 19:07 iameskild

Okay, thanks @iameskild. The users will definitely appreciate any improvement in the situation, even if not configurable!

rsignell-usgs avatar Aug 01 '22 14:08 rsignell-usgs

@iameskild , I remember you showed me how to (temporarily) override the short culler settings by connecting to some pod and editing a config file, right? After the upgrade from 0.4.3 to 0.4.4, the users are screaming again about the too-short timeout for their servers.

rsignell-usgs avatar Oct 22 '22 00:10 rsignell-usgs

Hey @rsignell-usgs, for now, you can manually edit the etc-jupyter configmap if you want to make changes to the timeout settings.

Although I still have to circle back to this when I have more time but as a quick update, I was looking into using Terraform's templatefile to make these values more easily configurable.

iameskild avatar Oct 24 '22 15:10 iameskild

This can also be achieved using overrides on the jupyterhub configuration to change the idle-culling variable values. Right now, the values that can be changed are those here

jupyterhub:
  overrides:
    cull:
      users: true

Some values come from the idle-culler extension that, as of now, only the above method can be used to update them.

viniciusdc avatar Oct 25 '22 16:10 viniciusdc

To change these, I can use k9s to ssh into the hub-** pod and then just edit them?

rsignell-usgs avatar Oct 25 '22 16:10 rsignell-usgs

@rsignell-usgs yep, just edit the file. You may need to kill the hub pod for the changes to take effect.

iameskild avatar Oct 25 '22 16:10 iameskild

What is the filename once I've ssh'ed into the hub pod?

rsignell-usgs avatar Oct 25 '22 16:10 rsignell-usgs

Here's the workaround recipe that should modify the cull settings (at least until the next qhub/nebari version is deployed):

  • in k9s, type ":configmap"
  • use arrow keys to highlight the etc-jupyter configmap
  • hit the e key to edit (make the changes below), then "esc"
  • still in k9s, type ":pod"
  • use arrow keys to highlight the pod that starts with hub-xx
  • kill the pod (). (don't worry, it will regenerate in just a few seconds)

Just for the record, I set everything to 30 minutes:


    # The interval (in seconds) on which to check for terminals exceeding the
    # inactive timeout value.
    c.TerminalManager.cull_interval = 30 * 60

    # cull_idle_timeout: timeout (in seconds) after which an idle kernel is
    # considered ready to be culled
    c.MappingKernelManager.cull_idle_timeout = 30 * 60

    # cull_interval: the interval (in seconds) on which to check for idle
    # kernels exceeding the cull timeout value
    c.MappingKernelManager.cull_interval = 30 * 60

    # cull_connected: whether to consider culling kernels which have one
    # or more connections
    c.MappingKernelManager.cull_connected = True

    # cull_busy: whether to consider culling kernels which are currently
    # busy running some code
    c.MappingKernelManager.cull_busy = False

    # Shut down the server after N seconds with no kernels or terminals
    # running and no activity.
    c.NotebookApp.shutdown_no_activity_timeout = 30 * 60

rsignell-usgs avatar Oct 25 '22 19:10 rsignell-usgs