kubespawner icon indicating copy to clipboard operation
kubespawner copied to clipboard

Having trouble similar to #209, pod-reflector restarting after long hang

Open alex-orange opened this issue 2 years ago • 13 comments

Bug description

Expected behaviour

Pod should start, quickly.

Actual behaviour

Pod doesn't start, takes 5 minutes to timeout, then starts quickly the second time. Seems to be pod-reflector is getting restarted.

How to reproduce

  1. Using binderhub.
  2. Start a binder.
  3. Container either already exists or is built fine.
  4. Container starts, pod exists in cluster but hub can't seem to find it.
  5. Wait 5 minutes, times out, complains something about pod-reflector among other things.
  6. Pod starts in seconds.

Your personal set up

  • Binderhub on a bare metal cluster.
  • Not sure what details are relavent, ask away, I'm happy to give anything not in a k8s secret.
  • Helm binderhub version binderhub-0.2.0-n870.h3d61c7a. I believe this to be a jupyterhub problem as it's failing in the hub pod it seems, but don't quite know how to get the associated jupyterhub version.
  • Logs
# paste relevant logs here, if any

[W 2022-02-23 21:44:16.404 JupyterHub spawner:2861] Ignoring unrecognized KubeSpawner user_options: binder_launch_host, binder_persistent_request, binder_ref_url, binder_request, image, repo_url, token [I 2022-02-23 21:44:16.417 JupyterHub log:189] 202 POST /hub/api/users/alex_orange-link-budget-binder-8iaji54m/servers/ ([email protected]) 104.34ms [I 2022-02-23 21:44:16.417 JupyterHub spawner:2302] Attempting to create pod jupyter-alex-5forange-2dlink-2dbudget-2dbinder-2d8iaji54m, with timeout 3 [E 2022-02-23 21:44:34.148 JupyterHub spawner:2581] Pod binderhub/jupyter-alex-5forange-2dlink-2dbudget-2dbinder-2dajne11h7 never showed up in reflector, restarting pod reflector Some really long message, too much for me to parse if secrets in it or not [I 2022-02-23 21:44:34.168 JupyterHub reflector:275] watching for pods with label selector='component=singleuser-server' in namespace binderhub [E 2022-02-23 21:44:34.171 JupyterHub gen:623] Exception in Future <Task finished name='Task-667053' coro=<KubeSpawner._start() done, defined at /usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py:2466> exception=TimeoutError('pod binderhub/jupyter-alex-5forange-2dlink-2dbudget-2dbinder-2dajne11h7 did not start in 300 seconds!')> after timeout Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 618, in error_callback future.result() File "/usr/local/lib/python3.8/dist-packages/kubespawner/spawner.py", line 2572, in _start await exponential_backoff( File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff raise TimeoutError(fail_message) TimeoutError: pod binderhub/jupyter-alex-5forange-2dlink-2dbudget-2dbinder-2dajne11h7 did not start in 300 seconds!

[W 2022-02-23 21:44:38.789 JupyterHub _version:41] Single-user server has no version header, which means it is likely < 0.8. Expected 1.4.2 [I 2022-02-23 21:44:38.789 JupyterHub base:909] User alex_orange-link-budget-binder-8iaji54m took 22.450 seconds to start [I 2022-02-23 21:44:38.789 JupyterHub proxy:285] Adding user alex_orange-link-budget-binder-8iaji54m to proxy /user/alex_orange-link-budget-binder-8iaji54m/ => http://x.x.x.x:8888 [I 2022-02-23 21:44:38.792 JupyterHub users:677] Server alex_orange-link-budget-binder-8iaji54m is ready [I 2022-02-23 21:44:38.793 JupyterHub log:189] 200 GET /hub/api/users/alex_orange-link-budget-binder-8iaji54m/server/progress ([email protected]) 22355.43ms

alex-orange avatar Feb 23 '22 22:02 alex-orange

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

welcome[bot] avatar Feb 23 '22 22:02 welcome[bot]

P.S. This seems related to #209 at the high level (something related to pod-reflector). I did check and at least some of the changes from that fix are in the code running on the hub pod. I assume the bits not there are just so out of date that they don't exist anymore.

alex-orange avatar Feb 23 '22 22:02 alex-orange

@alex-orange I'm sorry to not be able to help you, but this is too far from being an issue that can be connected to a fixable fault with KubeSpawner. I suggest this is raised as a question for help to get BinderHub functional in discourse.jupyter.org.

Note that it is very relevant to provide Helm chart configuration that you have passed to the BinderHub helm chart if you want help debugging your setup. Also, general k8s advice is to use kubectl describe pod if you have a pod that fail to start, then looking at the associated Event's in the end of the description you get. Those events clarify why a created pod is having trouble to start.

consideRatio avatar Mar 10 '22 14:03 consideRatio

Please reopen, this is NOT a support request, this IS a bug.

I think there is some confusion, this is NOT a problem of the pod failing to start, it IS a problem that the spawner is not aware that the pod has started and that it should connect the user. I'll attach my helm setup in just a sec. Please note this line of the log:

[E 2022-02-23 21:44:34.148 JupyterHub spawner:2581] Pod binderhub/jupyter-alex-5forange-2dlink-2dbudget-2dbinder-2dajne11h7 never showed up in reflector, restarting pod reflector

My parsing of that error message is that the reflector is falling over for some reason, that this is why it can't see the Pod (the pod is running, just not recognized by the spawner), it restarts, and then goes back to working properly.

Please note this will be a difficult to debug as it seems to be a timeout (though I don't know how long it takes to break).

alex-orange avatar Mar 10 '22 16:03 alex-orange

Here's the helm commands that I run, and I've attached the yaml files I use:

helm repo add jupyterhub https://jupyterhub.github.io/helm-chart
helm repo update
helm install binderhub jupyterhub/binderhub --devel --version=0.2.0-n870.h3d61c7a --namespace=binderhub -f secret.yaml -f config.yaml

The following are actually yaml files, github simply won't allow .yaml files. secret-ceph.yaml is just a k8s secret for the ceph credentials. config.txt secret.txt secret-ceph.txt

alex-orange avatar Mar 10 '22 16:03 alex-orange

Hi @alex-orange I've re-opened the issue for now. There wasn't much information in your original post and it sounded like a support question. If you reckon the problem is in KubeSpawner please could you try and reproduce it on a simpler system, e.g. with a minimal Z2JH, or even running KubeSpawner locally (with a local kubeconfig)? As @consideRatio mentioned it's very difficult for us to investigate as it sounds like it's specific to your setup, so we'll need to rely on you to do most of the investigation.

manics avatar Mar 10 '22 16:03 manics

Gladly, could someone point me to what you mean about running the KubeSpawner locally? I'm not sure where to look for details on that. Is there maybe some way that I can run some python code to access this "reflector" bit and maybe directly diagnose it? I suspect what's going on is that it's losing it's connection to the cluster and not realizing it.

As a clarification, I am only familiar with running this as a part of BinderHub, is KubeSpawner also used as part of the system with a JupyterHub kubernetes setup?

alex-orange avatar Mar 10 '22 18:03 alex-orange

I reacted to seeing the following messages in the logs:

[W 2022-02-23 21:44:16.404 JupyterHub spawner:2861] Ignoring unrecognized KubeSpawner user_options: binder_launch_host, binder_persistent_request, binder_ref_url, binder_request, image, repo_url, token

That seemed very wrong, and if that is wrong, I assumed misconfiguration, and that this would be a support question in the sense that instead of being an issue with KubeSpawner - it was an issue with how binderhub was configured, which influenced how KubeSpawner was configured, which made KubeSpawner fail for some reason unrelated to KubeSpawner.

Here are logs from the hub k8s Pod where JupyterHub and KubeSpawner runs in a healthy situation. I got them from looking at the "Kubernetes namespace report" step in the BinderHub's tests managed in the GitHub Workflow's job named test (latest, helm)}. From there I saw no warnings about weird configurations.

[D 2022-03-09 14:42:33.[321](https://github.com/jupyterhub/binderhub/runs/5481750759?check_suite_focus=true#step:28:321) JupyterHub application:730] Looking for /usr/local/etc/jupyterhub/jupyterhub_config in /srv/jupyterhub
Loading /usr/local/etc/jupyterhub/secret/values.yaml
No config at /usr/local/etc/jupyterhub/existing-secret/values.yaml
Loading extra config: 0-binderspawnermixin
Loading extra config: 00-binder
[D 2022-03-09 14:42:33.818 JupyterHub application:752] Loaded config file: /usr/local/etc/jupyterhub/jupyterhub_config.py
[I 2022-03-09 14:42:33.826 JupyterHub app:2459] Running JupyterHub version 1.4.2
[I 2022-03-09 14:42:33.826 JupyterHub app:2489] Using Authenticator: nullauthenticator.NullAuthenticator-1.0.0
[I 2022-03-09 14:42:33.826 JupyterHub app:2489] Using Spawner: builtins.BinderSpawner
[I 2022-03-09 14:42:33.826 JupyterHub app:2489] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.4.2
[D 2022-03-09 14:42:33.827 JupyterHub app:1701] Connecting to db: sqlite://
[D 2022-03-09 14:42:33.840 JupyterHub orm:784] Stamping empty database with alembic revision 4dc2d5a8c53c
[I 2022-03-09 14:42:33.843 alembic.runtime.migration migration:164] Context impl SQLiteImpl.
[I 2022-03-09 14:42:33.843 alembic.runtime.migration migration:167] Will assume non-transactional DDL.
[I 2022-03-09 14:42:33.847 alembic.runtime.migration migration:556] Running stamp_revision  -> 4dc2d5a8c53c
[D 2022-03-09 14:42:33.847 alembic.runtime.migration migration:752] new branch insert 4dc2d5a8c53c
[W 2022-03-09 14:42:33.855 JupyterHub app:1808] No admin users, admin interface will be unavailable.
[W 2022-03-09 14:42:33.855 JupyterHub app:1809] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2022-03-09 14:42:33.855 JupyterHub app:1838] Not using allowed_users. Any authenticated user will be allowed.
[I 2022-03-09 14:42:33.919 JupyterHub app:1961] Adding API token for service: binder
[D 2022-03-09 14:42:33.925 JupyterHub app:1990] Purging expired APITokens
[D 2022-03-09 14:42:33.926 JupyterHub app:1990] Purging expired OAuthAccessTokens
[D 2022-03-09 14:42:33.929 JupyterHub app:1990] Purging expired OAuthCodes
[D 2022-03-09 14:42:33.936 JupyterHub app:2113] Initializing spawners
[D 2022-03-09 14:42:33.938 JupyterHub app:2246] Loaded users:
    
[I 2022-03-09 14:42:33.938 JupyterHub app:2526] Initialized 0 spawners in 0.002 seconds
[I 2022-03-09 14:42:33.940 JupyterHub app:2738] Not starting proxy
[D 2022-03-09 14:42:33.940 JupyterHub proxy:832] Proxy: Fetching GET http://proxy-api:8001/api/routes
[D 2022-03-09 14:42:33.947 JupyterHub proxy:905] Omitting non-jupyterhub route '/'
[I 2022-03-09 14:42:33.948 JupyterHub app:2774] Hub API listening on http://:8081/hub/
[I 2022-03-09 14:42:33.948 JupyterHub app:2776] Private Hub API connect url http://hub:8081/hub/
[I 2022-03-09 14:42:33.949 JupyterHub app:2789] Starting managed service cull-idle
[I 2022-03-09 14:42:33.949 JupyterHub service:339] Starting service 'cull-idle': ['python3', '-m', 'jupyterhub_idle_culler', '--url=http://localhost:8081/hub/api', '--timeout=3600', '--cull-every=600', '--concurrency=10', '--cull-users']
[I 2022-03-09 14:42:33.951 JupyterHub service:121] Spawning python3 -m jupyterhub_idle_culler --url=http://localhost:8081/hub/api --timeout=3600 --cull-every=600 --concurrency=10 --cull-users
[D 2022-03-09 14:42:33.961 JupyterHub spawner:1179] Polling subprocess every 30s
[I 2022-03-09 14:42:33.961 JupyterHub app:2798] Adding external service binder
[D 2022-03-09 14:42:33.961 JupyterHub proxy:342] Fetching routes to check
[D 2022-03-09 14:42:33.962 JupyterHub proxy:832] Proxy: Fetching GET http://proxy-api:8001/api/routes
[D 2022-03-09 14:42:33.966 JupyterHub proxy:905] Omitting non-jupyterhub route '/'
[I 2022-03-09 14:42:33.966 JupyterHub proxy:347] Checking routes
[I 2022-03-09 14:42:33.966 JupyterHub proxy:432] Adding route for Hub: / => http://hub:8081
[D 2022-03-09 14:42:33.967 JupyterHub proxy:832] Proxy: Fetching POST http://proxy-api:8001/api/routes/
[I 2022-03-09 14:42:33.973 JupyterHub app:2849] JupyterHub is now running at http://:8000
[D 2022-03-09 14:42:33.974 JupyterHub app:2452] It took 0.658 seconds for the Hub to start

I still see this as a support question until it can be clearly pinpointed that this is a bug in KubeSpawner, rather than a misconfiguration of the BinderHub Helm chart or similar. It is too vaguely coupled to KubeSpawner. It may be a bug in KubeSpawner, but we are not enough people to handle the workload of maintaining kubespawner or other JupyterHub organization projects if we are to investigate things that may conclude it oh it was due to <something unrelated>.

I hope you can understand that open source software can't provide the in depth service that a paid project can provide. It is why I'm trying to manage the time I and other maintainers have by closing this issue after evaluating it as a likely configuration issue - even though I haven't ruled out it could be a bug which I really wish I could.

@alex-orange you raised questions that I sadly don't have time to allocate to answer more in depth, and they are relevant steps to concluding this is actually a bug rather than a configuration mistake or similar though.

consideRatio avatar Mar 10 '22 19:03 consideRatio

Ah, I didn't realize I should be looking at that line. Let me throw together a minimal binderhub with the least amount of extra stuff I can manage and see if it still puts that out.

I do understand about a lack of people to handle stuff. I just feel this is more like a bug then a misconfiguration as I'm running a fairly minimal setup. It also seems to trigger after a lot of inactivity, which makes me think that something is losing a connection and then not reconnecting properly.

alex-orange avatar Mar 14 '22 20:03 alex-orange

Actually, strike that. @consideRatio line 397 of the link that you provided:

[W 2022-03-09 14:43:04.038 JupyterHub spawner:2861] Ignoring unrecognized KubeSpawner user_options: binder_launch_host, binder_persistent_request, binder_ref_url, binder_request, image, repo_url, token

I will continue to attempt to find a repeatable set of operations to test this problem against, but I would say that means it's not a misconfiguration issue (at least wrt the line you've given from my log). I'm happy to investigate others if my log shows strangeness not in the main report.

alex-orange avatar Mar 14 '22 20:03 alex-orange

@alex-orange The KubeSpawner tests are run by setting kubeconfig, and running the tests separate from the K8s cluster: https://github.com/jupyterhub/kubespawner/tree/main/tests

If you can reproduce your problem this way that gives us something to investigate, otherwise it's most likely a problem with your configuration or your K8s setup.

manics avatar Mar 14 '22 21:03 manics

@manics Thanks, that's helpful. Just to clarify, you mean setting the KUBECONFIG env variable to a kubeconfig that can connect to a running cluster and then running some tests. If I'm guessing right as to what my issue is, I would be doing something like connect to cluster, sit for a few hours, try to spawn a notebook and then see what happens. I'll see if I can write up a test for that.

alex-orange avatar Mar 14 '22 21:03 alex-orange

I'm also experiencing this. Intermittently, user pods are spawned and all status checks return "pending" state, even when they've been deemed started and ready by Kubernetes. When hub attempts to remove these servers that aren't recognized as ready within the timeout period (300 secs), nothing happens and the pods remain active but hub has lost its reference.

Typically after restarting the hub pod, the single user servers will be recognized started, but this must happen before hub attempts removal.

The only antecedent I've recognized is when requests to start user pods happen in quick succession.

iandesj avatar May 18 '23 14:05 iandesj