zero-to-jupyterhub-k8s
zero-to-jupyterhub-k8s copied to clipboard
GKE Deployment with gvisor fails with Cloud DNS but not with kube-dns
Bug description
I'm deploying JupyterHub on GKE. I'd like there to be a separation between the default pool and a user pool, so I can use gvisor for container isolation. Currently, I have a Standard cluster running fine with kube-dns, Dataplane V2 turned off, a default pool, and a user pool with gvisor container isolation turned on, with scheduling.userPods.nodeAffinity.matchNodePurpose: require to route user servers to the user pool. This cluster works completely fine and runs well.
I've heard that GKE is moving towards Autopilot clusters (which have DataPlane V2 and Cloud DNS turned on), so I decided to create a new Standard cluster with DataPlane V2 and Cloud DNS turned on. This cluster was configured the same way: a default pool and a user pool created with gvisor container isolation turned on. However, in this setup, the hub pod is unable to communicate with the user server pod (see errors below).
In order to isolate the issue, I created a few more clusters with the exact same config (a default pool and a user pool) with Dataplane V2 on and:
- User pool created with
gvisorisolation, Cloud DNS,singleuser.extraPodConfig.runtimeClassName: gvisor: This is the failing cluster setup described above. - User pool created with
gvisorisolation, Cloud DNS,singleuser.extraPodConfig.runtimeClassNamenot set: This is also the failing cluster setup described above. Since that thegvisorruntime class name actually turns on gvisor, the issue happens even if gvisor is not turned on. - User pool created with
gvisorisolation, kube-dns instead of Cloud DNS: Works properly; no issues - User pool created without
gvisorisolation, Cloud DNS: Works properly; no issues
As a result, it looks like there's an issue with JupyterHub deployed on GKE with the following configuration:
- Standard GKE cluster
- Dataplane V2 enabled
- Cloud DNS enabled instead of
kube-dns - A default pool with
hub.jupyter.org/node-purpose: core. This is to route different pods to different nodes. - A user pool created with
gvisorcontainer isolation andhub.jupyter.org/node-purpose: userlabel. - Note that isolation is only turned on with when
gvisoris added to the runtime class name. However, the pod'sruntimeClassNamedoes not need to be set togvisorfor this issue to occur.
I've also looked at https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/3167 which previously addressed issues with Dataplane V2 and Cloud DNS, and but it seems like those changes have already made it into the latest release.
Note that I'm aware that gvisor is not officially supported by Z2JH, and that it's totally possible that this is an undocumented limitation between gvisor container isolation and Cloud DNS! However, this issue occurs even when gvisor is turned off and that there doesn't seem to be a larger issue using gvisor with Cloud DNS, there might be something JupyterHub-specific around DNS related here. Given that GKE is pushing people towards Autopilot/Cloud DNS clusters, and that Autopilot clusters by default have that option pre-selected (since the only change needed to use gvisor in an Autopilot cluster is to add the gvisor runtime class name), I wonder if this issue also prevents people from deploying JupyterHub on an Autopilot Cluster today.
(edit: looks like deploying a Autopilot cluster does not work, whether or not I use runtimeClassName: gvisor--the hub and server both time out while waiting for the server to start. That is, unless there's something else that needs to be done beyond what's already specified in the Discourse server.)
(edit2: I'm mistaken, I'm able to start an Autopilot cluster; launching the user server just takes a really long time since user placeholders don't take on user servers' taints/selectors, so we need to wait for a new node to be created and an image to be pulled)
How to reproduce
- Create a new Standard Cluster in GKE. When creating the cluster, select DataPlane V2 and Cloud DNS (instead of
kube-dns). - Edit the default node pool to add a Kubernetes label
hub.jupyter.org/node-purpose: coreto nodes in this cluster. This is to make sure we schedule pods on separate nodes depending on whether they're core or user pods. - Once the cluster is created, add a node pool that has
gvisorsandbox isolation turned on. Also addhub.jupyter.org/node-purpose: useras a Kubernetes label to nodes in this cluster. - Deploy JupyterHub to the cluster according to the standard instructions here. Use a helm 3.3.7 configuration file with the following settings:
hub:
baseUrl: "/jupyterhub"
singleuser:
# extraPodConfig:
# # It doesn't matter whether or not this is enabled or not. But in a properly working cluster, this needs to
# # be set in order to properly use gvisor isolation.
# runtimeClassName: gvisor
# GKE Sandbox (gvisor) blocks these pods from accessing the Cloud Metadata Server,
# so we don't need to block it via iptables instead.
# https://z2jh.jupyter.org/en/latest/administrator/security.html#audit-cloud-metadata-server-access
cloudMetadata:
blockWithIptables: false
scheduling:
userPods:
# The gvisor user pool comes by default with the `sandbox.gke.io/runtime:gvisor` taint, so we need to redeclare
# the tolerations and add that one.
# GKE might automatically set this toleration, but we explicitly set it anyways.
# https://cloud.google.com/kubernetes-engine/docs/how-to/sandbox-pods#regular-pod
tolerations:
- effect: NoSchedule
key: hub.jupyter.org/dedicated
operator: Equal
value: user
- effect: NoSchedule
key: hub.jupyter.org_dedicated
operator: Equal
value: user
- effect: NoSchedule
key: sandbox.gke.io/runtime
operator: Equal
value: gvisor
nodeAffinity:
matchNodePurpose: prefer # or require
cull:
enabled: true
- Try to start a user-server. It should fail with the error message listed below
Expected behaviour
The user server should initialize and register itself with the hub.
Working Logs
These are the logs I expect to see if communication between the hub and user server is working:
[I 2024-05-13 22:47:58.993 JupyterHub spawner:2511] Attempting to create pod jupyter-user-2d8, with timeout 3
I0513 22:47:59.017421 1 eventhandlers.go:118] "Add event for unscheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.021208 1 scheduling_queue.go:1070] "About to try and schedule pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.022545 1 schedule_one.go:81] "Attempting to schedule pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.024308 1 default_binder.go:52] "Attempting to bind pod to node" pod="jupyterhub-prod3/jupyter-user-2d8" node="gke-jupyterhub-prod3-user-pool-cloud--6d92de73-bbr4"
I0513 22:47:59.019448 1 eventhandlers.go:118] "Add event for unscheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.036024 1 schedule_one.go:252] "Successfully bound pod to node" pod="jupyterhub-prod3/jupyter-user-2d8" node="gke-jupyterhub-prod3-user-pool-cloud--6d92de73-bbr4" evaluatedNodes=2 feasibleNodes=1
I0513 22:47:59.036857 1 eventhandlers.go:186] "Add event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.036907 1 eventhandlers.go:161] "Delete event for unscheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.041285 1 eventhandlers.go:186] "Add event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.041284 1 eventhandlers.go:161] "Delete event for unscheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.074146 1 eventhandlers.go:206] "Update event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.074142 1 eventhandlers.go:206] "Update event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
[I 2024-05-13 22:47:59.110 JupyterHub log:192] 200 GET /jupyterhub/hub/api/users/user-8 ([email protected]) 7.50ms
I0513 22:47:59.837566 1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.StorageClass total 7 items received
I0513 22:47:59.909642 1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="104.548µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.0.1:51028" resp=200
I0513 22:47:59.910888 1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="74.645µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.0.1:51048" resp=200
I0513 22:47:59.917849 1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="109.625µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.0.1:51024" resp=200
I0513 22:47:59.919774 1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="61.369µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.0.1:51020" resp=200
I0513 22:48:00.295593 1 eventhandlers.go:206] "Update event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:48:00.297583 1 eventhandlers.go:206] "Update event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:48:01.085834 1 leaderelection.go:352] lock is held by user-scheduler-ffbb9c68f-jtf5q_dd1ddc82-dfa5-4b14-b457-9c0a922acd16 and has not yet expired
I0513 22:48:01.085857 1 leaderelection.go:253] failed to acquire lease jupyterhub-prod3/user-scheduler-lock
I0513 22:48:01.230979 1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.ReplicationController total 10 items received
[I 2024-05-13 22:48:01.773 JupyterHub log:192] 200 GET /jupyterhub/hub/api (@10.32.2.17) 0.90ms
[I 2024-05-13 22:48:02.004 JupyterHub log:192] 200 POST /jupyterhub/hub/api/users/user-8/activity ([email protected]) 14.48ms
[I 2024-05-13 22:48:03.363 ServerApp] 200 GET /jupyterhub/user/manager-1/api/kernels?1715640483157 ([email protected]) 0.99ms
I0513 22:48:03.839446 1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PersistentVolumeClaim total 10 items received
I0513 22:48:04.319039 1 leaderelection.go:352] lock is held by user-scheduler-ffbb9c68f-jtf5q_dd1ddc82-dfa5-4b14-b457-9c0a922acd16 and has not yet expired
I0513 22:48:04.319062 1 leaderelection.go:253] failed to acquire lease jupyterhub-prod3/user-scheduler-lock
[I 2024-05-13 22:48:04.460 ServerApp] 200 GET /jupyterhub/user/manager-1/api/kernels?1715640484260 ([email protected]) 1.16ms
[W 2024-05-13 22:48:04.644 JupyterHub _version:38] Single-user server has no version header, which means it is likely < 0.8. Expected 4.1.5
[I 2024-05-13 22:48:04.644 JupyterHub base:1090] User user-8 took 5.686 seconds to start
Actual behaviour
The user server pod is unable to communicate with the hub pod. After a while, the hub times out and the user server shuts down.
Erroring Logs
You can see here that the hub requests a spawn of the user server, but the API request fails. The JupyterHubSingleUser server, once started, fails to connect to the hub as well.
[I 2024-05-13 19:00:05.847 JupyterHub base:1616] User test-service requesting spawn on behalf of manager-1
[W 2024-05-13 19:00:05.848 JupyterHub base:1544] Failing suspected API request to not-running server: /jupyterhub/hub/user/manager-1/api/kernels
[W 2024-05-13 19:00:05.848 JupyterHub log:192] 424 GET /jupyterhub/hub/user/manager-1/api/kernels?1715626805678 ([email protected]) 8.15ms
I0513 19:00:06.507509 1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="113.911µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.2.1:54148" resp=200
I0513 19:00:06.511715 1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="104.277µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.2.1:54136" resp=200
I0513 19:00:06.660512 1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="139.622µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.2.1:36614" resp=200
I0513 19:00:06.663019 1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="111.271µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.2.1:36616" resp=200
[I 2024-05-13 19:00:06.667 JupyterHub log:192] 302 GET /jupyterhub/user/user-8/api/kernels?1715626806577 -> /jupyterhub/hub/user/user-8/api/kernels?1715626806577 (@10.138.0.44) 0.88ms
[I 2024-05-13 19:00:06.755 JupyterHub base:1616] User test-service requesting spawn on behalf of user-8
[W 2024-05-13 19:00:06.756 JupyterHub base:1544] Failing suspected API request to not-running server: /jupyterhub/hub/user/user-8/api/kernels
[W 2024-05-13 19:00:06.757 JupyterHub log:192] 424 GET /jupyterhub/hub/user/user-8/api/kernels?1715626806577 ([email protected]) 9.40ms
[E 2024-05-13 18:53:44.773 JupyterHubSingleUser] Failed to connect to my Hub at http://hub:8081/jupyterhub/hub/api (attempt 1/5). Is it running?
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/jupyterhub/singleuser/extension.py", line 336, in check_hub_version
resp = await client.fetch(self.hub_auth.api_url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[E 2024-05-13 18:53:44.790 ServerApp] Exception in callback functools.partial(<function _HTTPConnection.__init__.<locals>.<lambda> at 0x7cb061b69ee0>, <Task finished name='Task-3' coro=<_HTTPConnection.run() done, defined at /opt/conda/lib/python3.11/site-packages/tornado/simple_httpclient.py:290> exception=gaierror(-3, 'Temporary failure in name resolution')>
)
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/site-packages/tornado/ioloop.py", line 738, in _run_callback
ret = callback()
^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/tornado/simple_httpclient.py", line 287, in <lambda>
gen.convert_yielded(self.run()), lambda f: f.result()
^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/tornado/simple_httpclient.py", line 340, in run
stream = await self.tcp_client.connect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/tornado/tcpclient.py", line 269, in connect
addrinfo = await self.resolver.resolve(host, port, af)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/tornado/netutil.py", line 433, in resolve
for fam, _, _, _, address in await asyncio.get_running_loop().getaddrinfo(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 867, in getaddrinfo
return await self.run_in_executor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/socket.py", line 962, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution
[I 2024-05-11 08:10:36.220 JupyterHub base:1616] User test-service requesting spawn on behalf of user-15
[W 2024-05-11 08:10:36.220 JupyterHub base:1544] Failing suspected API request to not-running server: /jupyterhub/hub/user/user-15/api/kernels
[W 2024-05-11 08:22:53.418 JupyterHub log:192] 424 GET /jupyterhub/hub/user/user-15/api/kernels?1715415773273 ([email protected]) 9.57ms
08:22:57.502 [ConfigProxy] info: 200 GET /api/routes
[W 2024-05-11 08:23:02.818 JupyterHub user:936] user-8's server never showed up at http://10.32.0.23:8888/jupyterhub/user/user-8/ after 30 seconds. Giving up.
Common causes of this timeout, and debugging tips:
1. The server didn't finish starting,
or it crashed due to a configuration issue.
Check the single-user server's logs for hints at what needs fixing.
2. The server started, but is not accessible at the specified URL.
This may be a configuration issue specific to your chosen Spawner.
Check the single-user server logs and resource to make sure the URL
is correct and accessible from the Hub.
3. (unlikely) Everything is working, but the server took too long to respond.
To fix: increase `Spawner.http_timeout` configuration
to a number of seconds that is enough for servers to become responsive.
[I 2024-05-11 08:23:02.821 JupyterHub spawner:2849] Deleting pod jupyterhub-prod3/jupyter-user-2d8
[W 2024-05-11 08:23:08.259 JupyterHub base:1157] 3 consecutive spawns failed. Hub will exit if failure count reaches 5 before succeeding
[E 2024-05-11 08:23:08.259 JupyterHub gen:630] Exception in Future <Task finished name='Task-533' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.11/site-packages/jupyterhub/handlers/base.py:1081> exception=TimeoutError("Server at http://10.32.0.23:8888/jupyterhub/user/user-8/ didn't respond in 30 second
s")> after timeout
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/tornado/gen.py", line 625, in error_callback
future.result()
File "/usr/local/lib/python3.11/site-packages/jupyterhub/handlers/base.py", line 1088, in finish_user_spawn
await spawn_future
File "/usr/local/lib/python3.11/site-packages/jupyterhub/user.py", line 917, in spawn
await self._wait_up(spawner)
File "/usr/local/lib/python3.11/site-packages/jupyterhub/user.py", line 961, in _wait_up
raise e
File "/usr/local/lib/python3.11/site-packages/jupyterhub/user.py", line 931, in _wait_up
resp = await server.wait_up(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/jupyterhub/utils.py", line 316, in wait_for_http_server
re = await exponential_backoff(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/jupyterhub/utils.py", line 257, in exponential_backoff
raise asyncio.TimeoutError(fail_message)
TimeoutError: Server at http://10.32.0.23:8888/jupyterhub/user/user-8/ didn't respond in 30 seconds
[I 2024-05-11 08:23:08.262 JupyterHub log:192] 200 GET /jupyterhub/hub/api/users/user-8/server/progress ([email protected]) 37868.35ms
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:
Dataplane V2 enabled
This could be what its about i think, this is their name for using cilium as a a CNI in the k8s cluster - and when cilium enforces the charts NetworkPolicy rules, there can be issues because of a documented limitation in cilium.
Can you try disabling creation of networkpolicy resources or alternatively delete those created by the chart and see if that helps to start?
There is an issue or two about this in the repo already, also search for cilium / dataplane v2 and see if you find it. I'm on a mobile and can't quickly provide a link =\
For a resolution, its good to know how Cloud DNS is accessed from within the cluster - is it a pod in the cluster, or a private IP (10.x.x.x, 172.x.x.x, 192.168.x.x), or some unusual port?
I think you're talking about https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/3167 right? I read through that conversation and it seems like you applied all of the changes needed to get it working when closing out that issue (thanks!) so I expect it to be a different issue.
Forgot to mention, but while debugging this issue I did turn off all of the network policies for hub, singleuser, proxy in config.yaml for one test, and there was no change. The only change that made a difference was switching Cloud DNS to kube-dns and vice versa.
I'll see if I have some more time to dig into this but it'll be in a few days, since I now have a standard and Autopilot cluster (both with gvisor) working. I'm not too familiar with kubernetes networking, so I'd have to spend some time figuring out how to debug and get those logs working. It looks like Cloud DNS does create a pod in the cluster for DNS.
It might be ok to just say that a Standard Cluster with DataPlane V2, Cloud DNS, and a user pool with gvisor enabled (but not applied at a pod-level) doesn't work right now, since it's a super niche use case. Hopefully this doesn't cause any other issues down the road.
I think the reason with gvisor and cloud DNS it does not work is because of this limitation listed on GCP help page, https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#limitations-metadata