zero-to-jupyterhub-k8s GKE Deployment with gvisor fails with Cloud DNS but not with kube-dns

trafficstars

Bug description

I'm deploying JupyterHub on GKE. I'd like there to be a separation between the default pool and a user pool, so I can use gvisor for container isolation. Currently, I have a Standard cluster running fine with kube-dns, Dataplane V2 turned off, a default pool, and a user pool with gvisor container isolation turned on, with scheduling.userPods.nodeAffinity.matchNodePurpose: require to route user servers to the user pool. This cluster works completely fine and runs well.

I've heard that GKE is moving towards Autopilot clusters (which have DataPlane V2 and Cloud DNS turned on), so I decided to create a new Standard cluster with DataPlane V2 and Cloud DNS turned on. This cluster was configured the same way: a default pool and a user pool created with gvisor container isolation turned on. However, in this setup, the hub pod is unable to communicate with the user server pod (see errors below).

In order to isolate the issue, I created a few more clusters with the exact same config (a default pool and a user pool) with Dataplane V2 on and:

User pool created with gvisor isolation, Cloud DNS, singleuser.extraPodConfig.runtimeClassName: gvisor: This is the failing cluster setup described above.
User pool created with gvisor isolation, Cloud DNS, singleuser.extraPodConfig.runtimeClassName not set: This is also the failing cluster setup described above. Since that the gvisor runtime class name actually turns on gvisor, the issue happens even if gvisor is not turned on.
User pool created with gvisor isolation, kube-dns instead of Cloud DNS: Works properly; no issues
User pool created without gvisor isolation, Cloud DNS: Works properly; no issues

As a result, it looks like there's an issue with JupyterHub deployed on GKE with the following configuration:

Standard GKE cluster
Dataplane V2 enabled
Cloud DNS enabled instead of kube-dns
A default pool with hub.jupyter.org/node-purpose: core. This is to route different pods to different nodes.
A user pool created with gvisor container isolation and hub.jupyter.org/node-purpose: user label.
Note that isolation is only turned on with when gvisor is added to the runtime class name. However, the pod's runtimeClassName does not need to be set to gvisor for this issue to occur.

I've also looked at https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/3167 which previously addressed issues with Dataplane V2 and Cloud DNS, and but it seems like those changes have already made it into the latest release.

Note that I'm aware that gvisor is not officially supported by Z2JH, and that it's totally possible that this is an undocumented limitation between gvisor container isolation and Cloud DNS! However, this issue occurs even when gvisor is turned off and that there doesn't seem to be a larger issue using gvisor with Cloud DNS, there might be something JupyterHub-specific around DNS related here. Given that GKE is pushing people towards Autopilot/Cloud DNS clusters, and that Autopilot clusters by default have that option pre-selected (since the only change needed to use gvisor in an Autopilot cluster is to add the gvisor runtime class name), I wonder if this issue also prevents people from deploying JupyterHub on an Autopilot Cluster today.

(edit: looks like deploying a Autopilot cluster does not work, whether or not I use runtimeClassName: gvisor--the hub and server both time out while waiting for the server to start. That is, unless there's something else that needs to be done beyond what's already specified in the Discourse server.)

(edit2: I'm mistaken, I'm able to start an Autopilot cluster; launching the user server just takes a really long time since user placeholders don't take on user servers' taints/selectors, so we need to wait for a new node to be created and an image to be pulled)

How to reproduce

Create a new Standard Cluster in GKE. When creating the cluster, select DataPlane V2 and Cloud DNS (instead of kube-dns).
Edit the default node pool to add a Kubernetes label hub.jupyter.org/node-purpose: core to nodes in this cluster. This is to make sure we schedule pods on separate nodes depending on whether they're core or user pods.
Once the cluster is created, add a node pool that has gvisor sandbox isolation turned on. Also add hub.jupyter.org/node-purpose: user as a Kubernetes label to nodes in this cluster.
Deploy JupyterHub to the cluster according to the standard instructions here. Use a helm 3.3.7 configuration file with the following settings:

hub:
  baseUrl: "/jupyterhub"

singleuser:
  # extraPodConfig:
  #   # It doesn't matter whether or not this is enabled or not. But in a properly working cluster, this needs to 
  #   # be set in order to properly use gvisor isolation.
  #   runtimeClassName: gvisor
  # GKE Sandbox (gvisor) blocks these pods from accessing the Cloud Metadata Server,
  # so we don't need to block it via iptables instead.
  # https://z2jh.jupyter.org/en/latest/administrator/security.html#audit-cloud-metadata-server-access
  cloudMetadata:
    blockWithIptables: false

scheduling:
  userPods:
    # The gvisor user pool comes by default with the `sandbox.gke.io/runtime:gvisor` taint, so we need to redeclare 
    # the tolerations and add that one.
    # GKE might automatically set this toleration, but we explicitly set it anyways.
    # https://cloud.google.com/kubernetes-engine/docs/how-to/sandbox-pods#regular-pod
    tolerations:
      - effect: NoSchedule
        key: hub.jupyter.org/dedicated
        operator: Equal
        value: user
      - effect: NoSchedule
        key: hub.jupyter.org_dedicated
        operator: Equal
        value: user
      - effect: NoSchedule
        key: sandbox.gke.io/runtime
        operator: Equal
        value: gvisor
    nodeAffinity:
      matchNodePurpose: prefer # or require

cull:
  enabled: true

Try to start a user-server. It should fail with the error message listed below

Expected behaviour

The user server should initialize and register itself with the hub.

Working Logs

These are the logs I expect to see if communication between the hub and user server is working:

[I 2024-05-13 22:47:58.993 JupyterHub spawner:2511] Attempting to create pod jupyter-user-2d8, with timeout 3
I0513 22:47:59.017421       1 eventhandlers.go:118] "Add event for unscheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.021208       1 scheduling_queue.go:1070] "About to try and schedule pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.022545       1 schedule_one.go:81] "Attempting to schedule pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.024308       1 default_binder.go:52] "Attempting to bind pod to node" pod="jupyterhub-prod3/jupyter-user-2d8" node="gke-jupyterhub-prod3-user-pool-cloud--6d92de73-bbr4"
I0513 22:47:59.019448       1 eventhandlers.go:118] "Add event for unscheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.036024       1 schedule_one.go:252] "Successfully bound pod to node" pod="jupyterhub-prod3/jupyter-user-2d8" node="gke-jupyterhub-prod3-user-pool-cloud--6d92de73-bbr4" evaluatedNodes=2 feasibleNodes=1
I0513 22:47:59.036857       1 eventhandlers.go:186] "Add event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.036907       1 eventhandlers.go:161] "Delete event for unscheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.041285       1 eventhandlers.go:186] "Add event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.041284       1 eventhandlers.go:161] "Delete event for unscheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.074146       1 eventhandlers.go:206] "Update event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:47:59.074142       1 eventhandlers.go:206] "Update event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
[I 2024-05-13 22:47:59.110 JupyterHub log:192] 200 GET /jupyterhub/hub/api/users/user-8 ([email protected]) 7.50ms
I0513 22:47:59.837566       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.StorageClass total 7 items received
I0513 22:47:59.909642       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="104.548µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.0.1:51028" resp=200
I0513 22:47:59.910888       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="74.645µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.0.1:51048" resp=200
I0513 22:47:59.917849       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="109.625µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.0.1:51024" resp=200
I0513 22:47:59.919774       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="61.369µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.0.1:51020" resp=200
I0513 22:48:00.295593       1 eventhandlers.go:206] "Update event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:48:00.297583       1 eventhandlers.go:206] "Update event for scheduled pod" pod="jupyterhub-prod3/jupyter-user-2d8"
I0513 22:48:01.085834       1 leaderelection.go:352] lock is held by user-scheduler-ffbb9c68f-jtf5q_dd1ddc82-dfa5-4b14-b457-9c0a922acd16 and has not yet expired
I0513 22:48:01.085857       1 leaderelection.go:253] failed to acquire lease jupyterhub-prod3/user-scheduler-lock
I0513 22:48:01.230979       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.ReplicationController total 10 items received
[I 2024-05-13 22:48:01.773 JupyterHub log:192] 200 GET /jupyterhub/hub/api (@10.32.2.17) 0.90ms
[I 2024-05-13 22:48:02.004 JupyterHub log:192] 200 POST /jupyterhub/hub/api/users/user-8/activity ([email protected]) 14.48ms
[I 2024-05-13 22:48:03.363 ServerApp] 200 GET /jupyterhub/user/manager-1/api/kernels?1715640483157 ([email protected]) 0.99ms
I0513 22:48:03.839446       1 reflector.go:559] vendor/k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PersistentVolumeClaim total 10 items received
I0513 22:48:04.319039       1 leaderelection.go:352] lock is held by user-scheduler-ffbb9c68f-jtf5q_dd1ddc82-dfa5-4b14-b457-9c0a922acd16 and has not yet expired
I0513 22:48:04.319062       1 leaderelection.go:253] failed to acquire lease jupyterhub-prod3/user-scheduler-lock
[I 2024-05-13 22:48:04.460 ServerApp] 200 GET /jupyterhub/user/manager-1/api/kernels?1715640484260 ([email protected]) 1.16ms
[W 2024-05-13 22:48:04.644 JupyterHub _version:38] Single-user server has no version header, which means it is likely < 0.8. Expected 4.1.5
[I 2024-05-13 22:48:04.644 JupyterHub base:1090] User user-8 took 5.686 seconds to start

Actual behaviour

The user server pod is unable to communicate with the hub pod. After a while, the hub times out and the user server shuts down.

Erroring Logs

You can see here that the hub requests a spawn of the user server, but the API request fails. The JupyterHubSingleUser server, once started, fails to connect to the hub as well.

[I 2024-05-13 19:00:05.847 JupyterHub base:1616] User test-service requesting spawn on behalf of manager-1
[W 2024-05-13 19:00:05.848 JupyterHub base:1544] Failing suspected API request to not-running server: /jupyterhub/hub/user/manager-1/api/kernels
[W 2024-05-13 19:00:05.848 JupyterHub log:192] 424 GET /jupyterhub/hub/user/manager-1/api/kernels?1715626805678 ([email protected]) 8.15ms
I0513 19:00:06.507509       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="113.911µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.2.1:54148" resp=200
I0513 19:00:06.511715       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="104.277µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.2.1:54136" resp=200
I0513 19:00:06.660512       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="139.622µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.2.1:36614" resp=200
I0513 19:00:06.663019       1 httplog.go:132] "HTTP" verb="GET" URI="/healthz" latency="111.271µs" userAgent="kube-probe/1.28" audit-ID="" srcIP="10.32.2.1:36616" resp=200
[I 2024-05-13 19:00:06.667 JupyterHub log:192] 302 GET /jupyterhub/user/user-8/api/kernels?1715626806577 -> /jupyterhub/hub/user/user-8/api/kernels?1715626806577 (@10.138.0.44) 0.88ms
[I 2024-05-13 19:00:06.755 JupyterHub base:1616] User test-service requesting spawn on behalf of user-8
[W 2024-05-13 19:00:06.756 JupyterHub base:1544] Failing suspected API request to not-running server: /jupyterhub/hub/user/user-8/api/kernels
[W 2024-05-13 19:00:06.757 JupyterHub log:192] 424 GET /jupyterhub/hub/user/user-8/api/kernels?1715626806577 ([email protected]) 9.40ms
[E 2024-05-13 18:53:44.773 JupyterHubSingleUser] Failed to connect to my Hub at http://hub:8081/jupyterhub/hub/api (attempt 1/5). Is it running?
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.11/site-packages/jupyterhub/singleuser/extension.py", line 336, in check_hub_version
        resp = await client.fetch(self.hub_auth.api_url)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    tornado.simple_httpclient.HTTPTimeoutError: Timeout while connecting
[E 2024-05-13 18:53:44.790 ServerApp] Exception in callback functools.partial(<function _HTTPConnection.__init__.<locals>.<lambda> at 0x7cb061b69ee0>, <Task finished name='Task-3' coro=<_HTTPConnection.run() done, defined at /opt/conda/lib/python3.11/site-packages/tornado/simple_httpclient.py:290> exception=gaierror(-3, 'Temporary failure in name resolution')>
)
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.11/site-packages/tornado/ioloop.py", line 738, in _run_callback
        ret = callback()
              ^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/tornado/simple_httpclient.py", line 287, in <lambda>
        gen.convert_yielded(self.run()), lambda f: f.result()
                                                   ^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/tornado/simple_httpclient.py", line 340, in run
        stream = await self.tcp_client.connect(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/tornado/tcpclient.py", line 269, in connect
        addrinfo = await self.resolver.resolve(host, port, af)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/site-packages/tornado/netutil.py", line 433, in resolve
        for fam, _, _, _, address in await asyncio.get_running_loop().getaddrinfo(
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 867, in getaddrinfo
        return await self.run_in_executor(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/concurrent/futures/thread.py", line 58, in run
        result = self.fn(*self.args, **self.kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/conda/lib/python3.11/socket.py", line 962, in getaddrinfo
        for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socket.gaierror: [Errno -3] Temporary failure in name resolution
[I 2024-05-11 08:10:36.220 JupyterHub base:1616] User test-service requesting spawn on behalf of user-15
[W 2024-05-11 08:10:36.220 JupyterHub base:1544] Failing suspected API request to not-running server: /jupyterhub/hub/user/user-15/api/kernels
[W 2024-05-11 08:22:53.418 JupyterHub log:192] 424 GET /jupyterhub/hub/user/user-15/api/kernels?1715415773273 ([email protected]) 9.57ms
08:22:57.502 [ConfigProxy] info: 200 GET /api/routes
[W 2024-05-11 08:23:02.818 JupyterHub user:936] user-8's server never showed up at http://10.32.0.23:8888/jupyterhub/user/user-8/ after 30 seconds. Giving up.

    Common causes of this timeout, and debugging tips:

    1. The server didn't finish starting,
       or it crashed due to a configuration issue.
       Check the single-user server's logs for hints at what needs fixing.
    2. The server started, but is not accessible at the specified URL.
       This may be a configuration issue specific to your chosen Spawner.
       Check the single-user server logs and resource to make sure the URL
       is correct and accessible from the Hub.
    3. (unlikely) Everything is working, but the server took too long to respond.
       To fix: increase `Spawner.http_timeout` configuration
       to a number of seconds that is enough for servers to become responsive.

[I 2024-05-11 08:23:02.821 JupyterHub spawner:2849] Deleting pod jupyterhub-prod3/jupyter-user-2d8
[W 2024-05-11 08:23:08.259 JupyterHub base:1157] 3 consecutive spawns failed.  Hub will exit if failure count reaches 5 before succeeding
[E 2024-05-11 08:23:08.259 JupyterHub gen:630] Exception in Future <Task finished name='Task-533' coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.11/site-packages/jupyterhub/handlers/base.py:1081> exception=TimeoutError("Server at http://10.32.0.23:8888/jupyterhub/user/user-8/ didn't respond in 30 second
s")> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/tornado/gen.py", line 625, in error_callback
        future.result()
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/handlers/base.py", line 1088, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/user.py", line 917, in spawn
        await self._wait_up(spawner)
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/user.py", line 961, in _wait_up
        raise e
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/user.py", line 931, in _wait_up
        resp = await server.wait_up(
               ^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/utils.py", line 316, in wait_for_http_server
        re = await exponential_backoff(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/utils.py", line 257, in exponential_backoff
        raise asyncio.TimeoutError(fail_message)
    TimeoutError: Server at http://10.32.0.23:8888/jupyterhub/user/user-8/ didn't respond in 30 seconds

[I 2024-05-11 08:23:08.262 JupyterHub log:192] 200 GET /jupyterhub/hub/api/users/user-8/server/progress ([email protected]) 37868.35ms

May 13 '24 23:05 Hansenq

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively. welcome You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

May 13 '24 23:05 welcome[bot]

Dataplane V2 enabled

This could be what its about i think, this is their name for using cilium as a a CNI in the k8s cluster - and when cilium enforces the charts NetworkPolicy rules, there can be issues because of a documented limitation in cilium.

Can you try disabling creation of networkpolicy resources or alternatively delete those created by the chart and see if that helps to start?

There is an issue or two about this in the repo already, also search for cilium / dataplane v2 and see if you find it. I'm on a mobile and can't quickly provide a link =\

May 14 '24 05:05 consideRatio

For a resolution, its good to know how Cloud DNS is accessed from within the cluster - is it a pod in the cluster, or a private IP (10.x.x.x, 172.x.x.x, 192.168.x.x), or some unusual port?

May 14 '24 05:05 consideRatio

I think you're talking about https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/3167 right? I read through that conversation and it seems like you applied all of the changes needed to get it working when closing out that issue (thanks!) so I expect it to be a different issue.

Forgot to mention, but while debugging this issue I did turn off all of the network policies for hub, singleuser, proxy in config.yaml for one test, and there was no change. The only change that made a difference was switching Cloud DNS to kube-dns and vice versa.

I'll see if I have some more time to dig into this but it'll be in a few days, since I now have a standard and Autopilot cluster (both with gvisor) working. I'm not too familiar with kubernetes networking, so I'd have to spend some time figuring out how to debug and get those logs working. It looks like Cloud DNS does create a pod in the cluster for DNS.

It might be ok to just say that a Standard Cluster with DataPlane V2, Cloud DNS, and a user pool with gvisor enabled (but not applied at a pod-level) doesn't work right now, since it's a super niche use case. Hopefully this doesn't cause any other issues down the road.

May 14 '24 17:05 Hansenq

I think the reason with gvisor and cloud DNS it does not work is because of this limitation listed on GCP help page, https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#limitations-metadata

Oct 02 '24 21:10 vizeit

zero-to-jupyterhub-k8s zero-to-jupyterhub-k8s copied to clipboard

GKE Deployment with gvisor fails with Cloud DNS but not with kube-dns

Bug description

How to reproduce

Expected behaviour

Actual behaviour

zero-to-jupyterhub-k8s
zero-to-jupyterhub-k8s copied to clipboard