dask-gateway icon indicating copy to clipboard operation
dask-gateway copied to clipboard

Cannot connect to Dask Gateway scheduler - cluster.get_client() results in TimeoutError

Open JColl88 opened this issue 2 years ago • 16 comments

Context

We have deployed Dask Gateway (0.9.0) via Helm, exposing the Traefik proxy via an externally-facing NGINX proxy. External traffic is SSL-encrypted (https), and behind the proxy all traffic is http. The prefix value is /dask-gateway, so access to Dask Gateway is via the URL https://[domain]/dask-gateway, where [domain] is the domain name configured on the proxy machine.

A section of the NGINX location (which may be relevant to the issue) is:

location /dask-gateway/ {
    proxy_pass http://<internal_ip>:80/dask-gateway/;
    proxy_redirect off;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection 'upgrade';
    proxy_set_header Host $host;
    proxy_cache_bypass $http_upgrade;
}

What happened:

After instantiating a new GatewayCluster, I cannot call the get_client() method on it; doing so yields a TimeoutError. The error reads:

OSError: Timed out trying to connect to gateway://<domain>:443/my-dask-gateway.42e(...)2eb after 10 s

As the prefix value for the Dask Gateway deployment is /dask-gateway it looks like this could be related to that endpoint not being added to the URL.

What you expected to happen:

To receive a handle to a Client object.

Minimal Complete Verifiable Example:

from dask_gateway import Gateway, GatewayCluster
cluster = GatewayCluster('https://[domain]/dask-gateway/', auth="jupyterhub")
gateway = Gateway('https://[domain]/dask-gateway/', auth="jupyterhub")
gateway.list_clusters()
# [ClusterReport<name=my-dask-gateway.42e(...)2eb, status=RUNNING>]
cluster.scale(2)
client = cluster.get_client()
# OSError: Timed out trying to connect to gateway://<domain>:443/my-dask-gateway.42e(...)2eb after 10 s

Anything else we need to know?:

I've seen elsewhere that this could be related to the versions of dask or distributed being out of sync but am unsure exactly what versions are running on the Dask Gateway deployment (I've just deployed the latest version of the Helm chart 0.9.0), or how to check.

The client environment is a Binder-generated JupyterHub environment built from https://github.com/dask/dask-examples, which by default does not include the dask-gateway Python package.

The Dask Gateway is configured as a service of the test JupyterHub deployment for the purposes of authentication.

Environment:

  • Dask version: 2.20.0
  • Python version: 3.8.12
  • Operating System: Ubuntu 18.04 (Jupyter base notebook container)
  • Install method (conda, pip, source): pip

JColl88 avatar Mar 02 '22 18:03 JColl88

Update: I have subsequently tried changing the proxy location to be / (i.e. no URL suffix), and updated the dask-gateway deployment and Ingress path accordingly. Even in this case, I get TimeoutErrors when trying to get the cluster client, so I don't think it's related to the /dask-gateway in the URL. I also tried deploying an instance of DaskHub, accessing the JupyterHub's proxy-public via an SSH tunnel. This should therefore be able to access the Dask Gateway directly (i.e. not through any external proxy; only Dask Gateway's Traefik service) but once again when trying to get_client I see the error:

OSError: Timed out during handshake while connecting to gateway://traefik-dhub-dask-gateway.my-daskhub:80/my-daskhub.df2(...)e30 after 10 s

(where the URL looks to be correct i.e. <schema>://<service>.<namespace>:<port>)

This increases my suspicion that this is a bug, and should be reproducible.

I'd be interested if anyone knows a workaround using the dask.distributed.Client constructor directly; I've tried a few things but if I try for instance:

from distributed import Client
client = Client("gateway://<domain>/dask-gateway/my-dask-gateway.42e(...)2eb")
# TypeError: Gateway expects a `ssl_context` argument of type ssl.SSLContext, instead got None

JColl88 avatar Mar 03 '22 14:03 JColl88

Hi! Have you deployed dask-gateway with the latest Traefik CRDs from main? I ran into a similar situation with my call to cluster.get_client() timing out. By updating to the most recent Traefik CRDs from main I was able to resolve my cluster.get_client() call from timing out.

Hopefully this helps!

rigzba21 avatar Mar 14 '22 19:03 rigzba21

Thanks for the reply @rigzba21 - I think it sounds like the issue I'm seeing is related to this, but I'm now seeing the "Failed to watch ..." errors you describe in your linked issue, so I may have missed something.

What I've done

Pull the latest dask-gateway code and install via:

$ helm upgrade --install --namespace=my-dask-gateway my-dask-gateway dask-gateway/resources/helm/dask-gateway/ -f /<path>/<to>/values.yaml

Where values.yaml just has:

gateway:
  prefix: /dask-gateway/
  auth:
    type: jupyterhub
      jupyterhub:
        ...

This lead to the api- and traefik- pods restarting (so something updated) but now Traefik just returns 404s.

Inspecting the Traefik pod logs reveals a series of errors such as:

Failed to watch *v1alpha1.TLSStore:
Failed to watch *v1alpha1.ServersTransport:
Failed to watch *v1alpha1.IngressRouteUDP:
Failed to watch *v1alpha1.MiddlewareTCP:

so it doesn't appear to be running correctly. Is there anything I've missed, e.g. creating these resources manually?

PS Running K8s 1.20

JColl88 avatar Mar 16 '22 15:03 JColl88

You need to install the CRDs in https://github.com/dask/dask-gateway/tree/main/resources/helm/dask-gateway/crds . The references to "v1alpha1" suggests this didn't happen; and helm doesn't do this for you, you must delete the old entries and apply the new ones with kubectl directly.

martindurant avatar Mar 16 '22 15:03 martindurant

@JColl88 so I manually replaced the missing entries in https://github.com/dask/dask-gateway/blob/7b53ed72fc346650191e8fd2e6976fc60918c16f/resources/helm/dask-gateway/templates/traefik/rbac.yaml#L38-L53 and manually replaced the file https://github.com/dask/dask-gateway/blob/main/resources/helm/dask-gateway/crds/traefik.yaml in my helm chart and then ran a helm upgrade....

I don't think these changes have been added yet to an official helm chart release.

Hope this helps!

rigzba21 avatar Mar 16 '22 15:03 rigzba21

I don't think these changes have been added yet to an official helm chart release

Indeed, we are not yet able to make a release, unfortunately

martindurant avatar Mar 16 '22 15:03 martindurant

Thanks both for your help! I've done as @martindurant suggested and re-created the Traefik CRDs, and this resolves the Failed to watch errors I was seeing in Traefik, but I still encounter the OSError: Timed out trying to connect to gateway: errors mentioned in the original post.

@rigzba21 Haven't tried what you're suggesting yet but can do - was that to address the timeout errors though or the missing CRDs?

Edit: sorry hadn't read your message properly, I think as I've just pulled up to date with main I should have the latest changes @rigzba21 added manually.

JColl88 avatar Mar 16 '22 15:03 JColl88

was that to address the timeout errors though or the missing CRDs?

Because I was missing the up-to-date traefik CRDs and RBAC resource entries, this resulted in my call to cluster.get_client() being dropped at the traefik pod and never reaching the api/scheduler.

Initially, I was getting timeout errors and could not create a client connection. After some troubleshooting and debugging, I realized that the traefik pod was giving the errors mentioned in https://github.com/dask/dask-gateway/issues/483. After manually replacing the CRDs traefik.yaml file and adding the missing entries in rbac.yaml, then re-deploying my helm chart with those changes, this fixed both the Traefik pod errors and the client timeout I was running into.

rigzba21 avatar Mar 16 '22 15:03 rigzba21

It may be worth mentioning issue: https://github.com/dask/dask-gateway/issues/246 since there is some discussion of timeout errors with the get_client method there.

Having tried the suggestion in that issue:

cluster = GatewayCluster.from_name(
    "my-dask-gateway.cae...f8d",
    address="https://[domain]/dask-gateway/",
    proxy_address='gateway://traefik-my-dask-gateway.my-dask-gateway:80',
    auth="jupyterhub"
)
client = cluster.get_client()

I'm still encountering the same timeout errors mentioned above.

JColl88 avatar Mar 16 '22 16:03 JColl88

@JColl88 I will try and see if I can reproduce this issue in a local minikube cluster to help further troubleshoot. For your NGINX config, are you running ingress-nginx and setting the config via ingress annotations, or are you running NGINX separate from k8s? I just want to make sure my local setup is similar.

rigzba21 avatar Mar 21 '22 13:03 rigzba21

Thanks @rigzba21 - much appreciated.

We have a separate proxy machine natively running NGINX that listens for https traffic, and then routes using http to the head node of the K8s cluster where Dask Gateway is running. As there are multiple services in that cluster listening on port 80 (separate JupyterHub and Dask Gateway), in the respective namespaces there are Ingress resources, based on NGINX. For the my-dask-gateway namespace, the spec for this looks like (where <domain> has been substituted for the actual domain):

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: traefik-dask-gateway-ingress
  namespace: my-dask-gateway
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "false"
    nginx.ingress.kubernetes.io/proxy-body-size: 1000m
spec:
  rules:
  - host: <domain>
    http:
      paths:
      - path: /dask-gateway/
        backend:
          serviceName: traefik-my-dask-gateway
          servicePort: 80

The Ingress controller has been deployed using these definitions: https://github.com/JColl88/jupyterhub-on-k8s/blob/master/ingress/deployIngressController.yaml.

We have a number of services running in a similar way, so this shouldn't be problematic, but worth including for completeness. The fact that traffic can get through to the gateway API to start/scale/stop clusters, and that dashboard links work correctly, suggests that the proxy and Ingress are working as expected.

JColl88 avatar Mar 21 '22 16:03 JColl88

@JColl88 , do you also forward TCP traffic? I think that's probably necessary.

martindurant avatar Mar 21 '22 16:03 martindurant

Hi @martindurant - not explicitly; so this sounds like something I should explore further. I'm guessing if this isn't specifically done by NGINX I'll need to set this up for both the proxy NGINX service and the Ingress resource?

(If you know of helpful docs for this, feel free to point me in the right direction, otherwise I'll do some digging when I get a chance!)

JColl88 avatar Mar 21 '22 16:03 JColl88

Honestly, I don't know how it works through another layer of proxy, but certainly dask-gateway's traefik layer, whether a direct ingress or as nodeports on the cluster, need TCP incoming, as this is what the dask client speaks. The TCP traffic can share the same port as HTTP for the gateway API and the dashboards. Since it's TCP, it cannot depend on a path, so you probably need a separate port from any other HTTP traffic you might have, and pass everything through (except maybe SSL termination).

martindurant avatar Mar 21 '22 16:03 martindurant

@JColl88 maybe NGINX TCP/UDP Load Balancing and (possibly) ssl_preread for SNI?

I'm not sure if this is what would help.

rigzba21 avatar Mar 21 '22 22:03 rigzba21

Thanks both for the help - this does sound plausible if it's only the gateway:// traffic that's getting stuck.

I'll look into it as soon as possible, but this was a short innovation project a couple of weeks ago so I may struggle to prioritise it imminently. If you're happy for me to leave the ticket open in the meantime I will update if/when possible, but equally if you'd rather clear it from the backlog, go ahead and mark as closed for now.

Hopefully setting Dask Gateway behind an additional NGINX proxy is a common enough approach for this issue to be of wider interest.

JColl88 avatar Mar 21 '22 22:03 JColl88

It took me a while to get back to this. As @martindurant and @rigzba21 suggested, the problem I was seeing was a result of the gateway:// TCP traffic getting stuck at our NGINX reverse proxy. Here are some notes to hopefully help those with a similar setup (if there are any!) workaround the issue.

Optional Debugging to verify deployment:

(Feel free to skip to NGINX config section below.)

For debugging, it can be helpful to test by directing traffic from within the same K8s cluster, in much the same way as I mentioned above.

First, it was helpful to test from a DaskHub deployment's JupyterHub service. By default this will point to the DaskHub's own Dask Gateway instance, but by passing a URL to a separate Dask Gateway deployment (the one we're debugging), we can test this with a client that is known to work. I noticed this requires the proxy_address variable to be explicitly set (otherwise the gateway:// traffic is still sent to the daskhub namespace's Traefik proxy). E.g.

from dask_gateway import GatewayCluster

cluster = GatewayCluster(
    address='http://traefik-my-dask-gateway.my-dask-gateway:8083/dask-gateway/',
    proxy_address='http://traefik-my-dask-gateway.my-dask-gateway:8083/dask-gateway/',
    auth="jupyterhub"
)
cluster.get_client()

where service name is traefik-my-dask-gateway, namespace is my-dask-gateway and 8083 is the internal port the service is listening on.

Once this has been verified, it can be tested from a standalone JupyterHub instance, but I'd recommend starting with the same environment used by DaskHub (https://github.com/dask/helm-chart/blob/main/daskhub/values.yaml#L57). Version mismatch issues can seemingly also yield the OSError: Timed out trying to connect to gateway:// error, so testing from an environment known to work is helpful.

NGINX config for forwarding gateway:// traffic

The stream directive must be used to forward traffic. See https://docs.nginx.com/nginx/admin-guide/load-balancer/tcp-udp-load-balancer/ as linked above.

An example block would look like:

stream {
    server {
        listen 80;
        proxy_pass <dask-gateway_host_ip>:<traefik_nodeport>;
    }
}

where the <traefik_nodeport> is the high port number (typically 3XXXX). Note, there cannot be a protocol (http://) ahead of the <dask-gateway_host_ip>.

With this, the service can be queried from clients inside or outside the cluster e.g.:

from dask_gateway import Gateway, GatewayCluster
cluster = GatewayCluster('http://<proxy_floating_ip>:80/dask-gateway/', auth="jupyterhub")
gateway = Gateway('http://<proxy_floating_ip>:80/dask-gateway/', auth="jupyterhub")
gateway.list_clusters()

[ClusterReport<name=my-dask-gateway.5d2...a4b, status=RUNNING>]

client = cluster.get_client()
print(client)
<Client: 'tls://10.243.2.54:8786' processes=0 threads=0, memory=0 B>

JColl88 avatar Nov 10 '22 17:11 JColl88

@JColl88 wow such a great followup!!! Thank you soo much for writing this down for others, followups like these are soo valuable!

Do you think the title should be updated or not in some way to help others find their way to your insights when searching around?

consideRatio avatar Nov 11 '22 05:11 consideRatio

Good idea @consideRatio, I've modified the title though let me know if you can think of a way to make it even clearer. From my side I'm happy for the issue to be closed now.

JColl88 avatar Nov 11 '22 12:11 JColl88

Thank you soo much @JColl88 for reporting, investigating, and following this up so clearly!!! I've also learned from your experience now :)

All the best!

consideRatio avatar Nov 11 '22 12:11 consideRatio

Just adding one additional comment to this, since this issue can also emerge due to blocked TCP traffic intra-cluster, depending on the CNI used.

In our case, on an old cluster with Flannel to handle networking, there was no issue running vanilla DaskHub and connecting to the clusters, but once we moved to a newer infra with Calico as the CNI, I started seeing the TimeoutErrors again when trying to get client handles.

The cause is that Calico enforces network policies: https://docs.tigera.io/calico/latest/network-policy/get-started/calico-policy/calico-network-policy#ingress-and-egress

TL;DR a simple solution in this case is simply to allow all traffic within a namespace (from Jupyter notebook envs to the DG Traefik proxy), by creating a new network policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all-ingress-egress
spec:
  podSelector: {}
  egress:
  - {}
  ingress:
  - {}
  policyTypes:
  - Egress
  - Ingress

Also related: https://github.com/dask/dask-gateway/issues/360

JColl88 avatar Jul 25 '23 11:07 JColl88