jupyterhub-idle-culler icon indicating copy to clipboard operation
jupyterhub-idle-culler copied to clipboard

Culler not working when there are too many users?

Open CarlosDominguezBecerril opened this issue 10 months ago • 4 comments

Bug description

Culler is not culling despite the last activity clearly being surpassed by the maximum time allowed.

Current setup from our k8s logs: Starting service 'cull-idle': ['python3', '-m', 'jupyterhub_idle_culler', '--url=https://<url>:5081/hub/api', '--timeout=14400', '--cull-every=600', '--concurrency=10']

We observe two things:

  • user is using the default server
"last_activity": "2025-01-23T16:01:47.564284Z",
"started": "2025-01-13T09:57:08.123732Z",
  • user is not using the analytics named server
"last_activity": "2025-01-20T12:06:54.100549Z",
"started": "2025-01-13T13:51:04.802281Z",

In the second case the last_activity was over 3days ago which is way more than the timeout of 14400 (4 hours)

Example obtained from /api/users/<user>

{
    "server": "/user/<user>/",
    "created": "2023-02-06T09:37:18.564165Z",
    "admin": false,
    "name": "<user>",
    "kind": "user",
    "last_activity": "2025-01-23T16:06:51.635146Z",
    "roles": [
        "user"
    ],
    "auth_state": {
        "email": "<email>"
    },
    "pending": null,
    "groups": [
        "<groups>"
    ],
    "servers": {
        "": {
            "name": "",
            "full_name": "<user>/",
            "last_activity": "2025-01-23T16:01:47.564284Z",
            "started": "2025-01-13T09:57:08.123732Z",
            "pending": null,
            "ready": true,
            "stopped": false,
            "url": "/user/<user>/",
            "user_options": {
                "profile": "scipy-spark",
                "server_size": "model_training",
                "service_account": "data-science"
            },
            "progress_url": "/hub/api/users/<user>/server/progress",
            "full_url": null,
            "full_progress_url": null,
            "state": {
                "pod_name": "jupyter-<first_name>-2d<last_name>",
                "namespace": "jupyter-hub",
                "dns_name": "jupyter-<first_name>-2d<last_name>.jupyter-hub.svc.cluster.local"
            }
        },
        "analytics": {
            "name": "analytics",
            "full_name": "<user>/analytics",
            "last_activity": "2025-01-20T12:06:54.100549Z",
            "started": "2025-01-13T13:51:04.802281Z",
            "pending": null,
            "ready": true,
            "stopped": false,
            "url": "/user/<user>/analytics/",
            "user_options": {
                "profile": "python-3-analytics",
                "server_size": "default",
                "service_account": "data-science"
            },
            "progress_url": "/hub/api/users/<user>/servers/analytics/progress",
            "full_url": null,
            "full_progress_url": null,
            "state": {
                "pod_name": "jupyter-<first_name>-2d<last_name>--analytics",
                "namespace": "jupyter-hub",
                "dns_name": "jupyter-<first_name>-2d<last_name>--analytics.jupyter-hub.svc.cluster.local"
            }
        }
    }
}

How to reproduce

I don't know how to reproduce this, we have jupyterhub deployed in several datacenters but only fails in our busiest one with over 150 servers simultaneously at the same time (not sure if it's an issue).

I'm afraid that the service is blocked / not running as expected because there are too many users

Expected behaviour

To cull the servers after they timeout

Actual behaviour

Is not culling the servers after the timeout

Your personal set up

Latest version of jupyterhub and idle culler

Logs

I can see the following logs from the culler: ⚠ Not sure if important but the fetching page 2 doesn't show the port just the url

Jan 23 16:41:12.885
nbt-hub
[I 250123 15:41:11 __init__:156] Fetching page 2 https://<url without port>/hub/api/users?state=ready&offset=50&limit=50

Jan 23 16:31:12.600
nbt-hub
File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 124, in fetch

Jan 23 16:31:12.600
nbt-hub
File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 142, in fetch_paginated

Jan 23 16:31:12.599
nbt-hub
File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 436, in cull_idle

Sometimes I also get

/usr/lib/python3.10/collections/__init__.py:431: RuntimeWarning: coroutine 'cull_idle.<locals>.handle_user' was never awaited

CarlosDominguezBecerril avatar Jan 23 '25 17:01 CarlosDominguezBecerril

After looking into the issue I found the problem.

During paginationnext_info["url"] doesn't contain the port in the url

req.url = next_info["url"]

https://github.com/jupyterhub/jupyterhub-idle-culler/blob/main/jupyterhub_idle_culler/init.py#L158

which throws an error during fetching

[I 250123 20:02:44 __init__:156] Fetching page 2 https://<my_domain>/hub/api/users?state=ready&offset=50&limit=50
[E 250123 20:02:44 ioloop:770] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0xf73e1c04f580>>, <Task finished name='Task-1' coro=<cull_idle() done, defined at /usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py:78> exception=ConnectionRefusedError(111, 'Connection refused')>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 750, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 774, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 436, in cull_idle
        async for user in fetch_paginated(req):
      File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 142, in fetch_paginated
        response = await resp_future
      File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 124, in fetch
        return await client.fetch(req)
    ConnectionRefusedError: [Errno 111] Connection refused

Dummy fix for my use case would be:

req.url = next_info["url"].replace(
    "<my_url>",
    "<my_url>:5081"
)

Not ideal, but I was wondering, Is this something jupyterhub should return properly? or the problem is the culler?

I believe it could be jupyterhub since if we have hub on

localhost:8000

first call would look like

localhost:8000/hub/api/users?state=ready&offset=0&limit=50

and the pagination would return as next

localhost/hub/api/users?state=ready&offset=0&limit=50 without the port which is incorrect

CarlosDominguezBecerril avatar Jan 23 '25 19:01 CarlosDominguezBecerril

Please can you tell us your Z2JH version, and share your full configuration?

manics avatar Feb 04 '25 15:02 manics

We are using Z2JH version 2.0.0. As for our configuration:

For our culler config:

# cull relates to the jupyterhub-idle-culler service, responsible for evicting
# inactive singleuser pods.
#
# The configuration below, except for enabled, corresponds to command-line flags
# for jupyterhub-idle-culler as documented here:
# https://github.com/jupyterhub/jupyterhub-idle-culler#as-a-standalone-script
#
cull:
  enabled: true
  users: false # --cull-users
  removeNamedServers: false # --remove-named-servers
  timeout: 14400 # --timeout (in seconds) -> server will shutdown if inactive after 4 hours
  every: 600 # --cull-every (in seconds) -> checks for expired servers every 10 minutes
  concurrency: 10 # --concurrency
  maxAge: 0 # --max-age

Installed versions that I believe matter:

"jupyterhub==5.2.0",
"jupyterhub-traefik-proxy==2.0.0",
"jupyterhub-idle-culler==1.4.0",
"jupyterhub-announcement",
"kubernetes==28.1.0",

CarlosDominguezBecerril avatar Feb 14 '25 09:02 CarlosDominguezBecerril

Can confirm we are seeing this too. Activity is not updating and idle-culler appears to be not culling.

[D 2025-07-04 21:40:08.766 JupyterHub scopes:1010] Checking access to /hub/api/users/johndoe/activity via scope users:activity!user=johndoe
[D 2025-07-04 21:40:08.771 JupyterHub users:964] Not updating activity for <User(johndoe1/1 running)>: 2025-07-04T21:30:06.405342Z < 2025-07-04T21:30:06.405342Z
[D 2025-07-04 21:40:08.772 JupyterHub users:985] Not updating server activity on johndoe/: 2025-07-04T21:30:06.405342Z < 2025-07-04T21:30:06.405342Z

distributedlock avatar Jul 04 '25 21:07 distributedlock