dask-gateway
dask-gateway copied to clipboard
Dask helm deployment not working in AKS
What happened:
I am trying to deploy the dask gateway to azure following the documentation : https://gateway.dask.org/install-kube.html
We already have an AKS cluster which is configured to use traefik ingress. In order to avoid a duplicate deployment of traefik, I downloaded the latest version of chart and created a modified version by removing the contents inside the template/traefik folder. Rest everything is same as the the official helmchart.
I deployed dask gateway successfully and the pods are also running without crashing. Then I tried to access the deployed dask gateway instance from a jupyternotebook also deployed within the same cluster. Since I only need to access it within the cluster tried directly accessing the clusterIP service : api-
Could you please help in resolving this issue
What you expected to happen:
Minimal Complete Verifiable Example:
# Put your MCVE code here
Anything else we need to know?:
Environment:
- Dask version: dask helm chart 0.9.0
- Python version:
- Operating System:
- Install method (conda, pip, source): helm
Could you share your Dask Gateway config? Particularly your auth config.
Hi @jacobtomlinson ,
Thanks for looking into this. Please find the values.yaml and the helm debug output below.
gateway:
# Number of instances of the gateway-server to run
replicas: 1
# Annotations to apply to the gateway-server pods.
annotations: {}
# Resource requests/limits for the gateway-server pod.
resources: {}
# Path prefix to serve dask-gateway api requests under
# This prefix will be added to all routes the gateway manages
# in the traefik proxy.
prefix: /
# The gateway server log level
loglevel: INFO
# The image to use for the gateway-server pod.
image:
name: <azure_container_registry>/daskgateway/dask-gateway-server
tag: 0.9.0
pullPolicy: IfNotPresent
# Image pull secrets for gateway-server pod
imagePullSecrets: []
# Configuration for the gateway-server service
service:
annotations: {}
auth:
# The auth type to use. One of {simple, kerberos, jupyterhub, custom}.
type: simple
simple:
# A shared password to use for all users.
password: null
kerberos:
# Path to the HTTP keytab for this node.
keytab: null
jupyterhub:
# A JupyterHub api token for dask-gateway to use. See
# https://gateway.dask.org/install-kube.html#authenticating-with-jupyterhub.
apiToken: null
# JupyterHub's api url. Inferred from JupyterHub's service name if running
# in the same namespace.
apiUrl: null
custom:
# The full authenticator class name.
class: null
# Configuration fields to set on the authenticator class.
options: {}
livenessProbe:
# Enables the livenessProbe.
enabled: true
# Configures the livenessProbe.
initialDelaySeconds: 5
timeoutSeconds: 2
periodSeconds: 10
failureThreshold: 6
readinessProbe:
# Enables the readinessProbe.
enabled: true
# Configures the readinessProbe.
initialDelaySeconds: 5
timeoutSeconds: 2
periodSeconds: 10
failureThreshold: 3
backend:
# The image to use for both schedulers and workers.
image:
name: <azure_container_registry>/daskgateway/dask-gateway
tag: 0.9.0
pullPolicy: IfNotPresent
# The namespace to launch dask clusters in. If not specified, defaults to
# the same namespace the gateway is running in.
namespace: null
# A mapping of environment variables to set for both schedulers and workers.
environment: null
scheduler:
# Any extra configuration for the scheduler pod. Sets
# `c.KubeClusterConfig.scheduler_extra_pod_config`.
extraPodConfig: {}
# Any extra configuration for the scheduler container.
# Sets `c.KubeClusterConfig.scheduler_extra_container_config`.
extraContainerConfig: {}
# Cores request/limit for the scheduler.
cores:
request: null
limit: null
# Memory request/limit for the scheduler.
memory:
request: null
limit: null
worker:
# Any extra configuration for the worker pod. Sets
# `c.KubeClusterConfig.worker_extra_pod_config`.
extraPodConfig: {}
# Any extra configuration for the worker container. Sets
# `c.KubeClusterConfig.worker_extra_container_config`.
extraContainerConfig: {}
# Cores request/limit for each worker.
cores:
request: null
limit: null
# Memory request/limit for each worker.
memory:
request: null
limit: null
# Settings for nodeSelector, affinity, and tolerations for the gateway pods
nodeSelector: {}
affinity: {}
tolerations: []
# Any extra configuration code to append to the generated `dask_gateway_config.py`
# file. Can be either a single code-block, or a map of key -> code-block
# (code-blocks are run in alphabetical order by key, the key value itself is
# meaningless). The map version is useful as it supports merging multiple
# `values.yaml` files, but is unnecessary in other cases.
extraConfig: {}
# Configuration for the gateway controller
controller:
# Whether the controller should be deployed. Disabling the controller allows
# running it locally for development/debugging purposes.
enabled: true
# Any annotations to add to the controller pod
annotations: {}
# Resource requests/limits for the controller pod
resources: {}
# Image pull secrets for controller pod
imagePullSecrets: []
# The controller log level
loglevel: INFO
# Max time (in seconds) to keep around records of completed clusters.
# Default is 24 hours.
completedClusterMaxAge: 86400
# Time (in seconds) between cleanup tasks removing records of completed
# clusters. Default is 5 minutes.
completedClusterCleanupPeriod: 600
# Base delay (in seconds) for backoff when retrying after failures.
backoffBaseDelay: 0.1
# Max delay (in seconds) for backoff when retrying after failures.
backoffMaxDelay: 300
# Limit on the average number of k8s api calls per second.
k8sApiRateLimit: 50
# Limit on the maximum number of k8s api calls per second.
k8sApiRateLimitBurst: 100
# The image to use for the controller pod.
image:
name: <azure_container_registry>/daskgateway/dask-gateway-server
tag: 0.9.0
pullPolicy: IfNotPresent
# Settings for nodeSelector, affinity, and tolerations for the controller pods
nodeSelector: {}
affinity: {}
tolerations: []
# Configuration for the traefik proxy
traefik:
# Number of instances of the proxy to run
replicas: 1
# Any annotations to add to the proxy pods
annotations: {}
# Resource requests/limits for the proxy pods
resources: {}
# The image to use for the proxy pod
image:
name: traefik
tag: 2.1.3
# Any additional arguments to forward to traefik
additionalArguments: []
# The proxy log level
loglevel: WARN
# Whether to expose the dashboard on port 9000 (enable for debugging only!)
dashboard: false
# Additional configuration for the traefik service
service:
type: LoadBalancer
annotations: {}
spec: {}
ports:
web:
# The port HTTP(s) requests will be served on
port: 80
nodePort: null
tcp:
# The port TCP requests will be served on. Set to `web` to share the
# web service port
port: web
nodePort: null
# Settings for nodeSelector, affinity, and tolerations for the traefik pods
nodeSelector: {}
affinity: {}
tolerations: []
rbac:
# Whether to enable RBAC.
enabled: true
# Existing names to use if ClusterRoles, ClusterRoleBindings, and
# ServiceAccounts have already been created by other means (leave set to
# `null` to create all required roles at install time)
controller:
serviceAccountName: null
gateway:
serviceAccountName: null
traefik:
serviceAccountName: null
Best Regards, Aravind
You're getting a 403 error. How are you authenticating with Dask Gateway?
I was trying to test this component in AKS by following this document : https://gateway.dask.org/install-kube.html. So I havent configured any authenticator and by default I believe the simple authenticator would be used. So I was trying to connect to the dask-gateway using the below code from a jupyter notebook instance deployed in the same cluster in another namespace:
from dask_gateway import Gateway
gateway = Gateway(address="http://
I presume since no password is configured I can omit the auth parameter.
Yeah I would expect this to work.
Could you also share the pod logs for the gateway server?
Hi @jacobtomlinson ,
Thanks for looking into this. Please find the values.yaml and the helm debug output below.
gateway: # Number of instances of the gateway-server to run replicas: 1 # Annotations to apply to the gateway-server pods. annotations: {} # Resource requests/limits for the gateway-server pod. resources: {} # Path prefix to serve dask-gateway api requests under # This prefix will be added to all routes the gateway manages # in the traefik proxy. prefix: / # The gateway server log level loglevel: INFO # The image to use for the gateway-server pod. image: name: <azure_container_registry>/daskgateway/dask-gateway-server tag: 0.9.0 pullPolicy: IfNotPresent # Image pull secrets for gateway-server pod imagePullSecrets: [] # Configuration for the gateway-server service service: annotations: {} auth: # The auth type to use. One of {simple, kerberos, jupyterhub, custom}. type: simple simple: # A shared password to use for all users. password: null kerberos: # Path to the HTTP keytab for this node. keytab: null jupyterhub: # A JupyterHub api token for dask-gateway to use. See # https://gateway.dask.org/install-kube.html#authenticating-with-jupyterhub. apiToken: null # JupyterHub's api url. Inferred from JupyterHub's service name if running # in the same namespace. apiUrl: null custom: # The full authenticator class name. class: null # Configuration fields to set on the authenticator class. options: {} livenessProbe: # Enables the livenessProbe. enabled: true # Configures the livenessProbe. initialDelaySeconds: 5 timeoutSeconds: 2 periodSeconds: 10 failureThreshold: 6 readinessProbe: # Enables the readinessProbe. enabled: true # Configures the readinessProbe. initialDelaySeconds: 5 timeoutSeconds: 2 periodSeconds: 10 failureThreshold: 3 backend: # The image to use for both schedulers and workers. image: name: <azure_container_registry>/daskgateway/dask-gateway tag: 0.9.0 pullPolicy: IfNotPresent # The namespace to launch dask clusters in. If not specified, defaults to # the same namespace the gateway is running in. namespace: null # A mapping of environment variables to set for both schedulers and workers. environment: null scheduler: # Any extra configuration for the scheduler pod. Sets # `c.KubeClusterConfig.scheduler_extra_pod_config`. extraPodConfig: {} # Any extra configuration for the scheduler container. # Sets `c.KubeClusterConfig.scheduler_extra_container_config`. extraContainerConfig: {} # Cores request/limit for the scheduler. cores: request: null limit: null # Memory request/limit for the scheduler. memory: request: null limit: null worker: # Any extra configuration for the worker pod. Sets # `c.KubeClusterConfig.worker_extra_pod_config`. extraPodConfig: {} # Any extra configuration for the worker container. Sets # `c.KubeClusterConfig.worker_extra_container_config`. extraContainerConfig: {} # Cores request/limit for each worker. cores: request: null limit: null # Memory request/limit for each worker. memory: request: null limit: null # Settings for nodeSelector, affinity, and tolerations for the gateway pods nodeSelector: {} affinity: {} tolerations: [] # Any extra configuration code to append to the generated `dask_gateway_config.py` # file. Can be either a single code-block, or a map of key -> code-block # (code-blocks are run in alphabetical order by key, the key value itself is # meaningless). The map version is useful as it supports merging multiple # `values.yaml` files, but is unnecessary in other cases. extraConfig: {} # Configuration for the gateway controller controller: # Whether the controller should be deployed. Disabling the controller allows # running it locally for development/debugging purposes. enabled: true # Any annotations to add to the controller pod annotations: {} # Resource requests/limits for the controller pod resources: {} # Image pull secrets for controller pod imagePullSecrets: [] # The controller log level loglevel: INFO # Max time (in seconds) to keep around records of completed clusters. # Default is 24 hours. completedClusterMaxAge: 86400 # Time (in seconds) between cleanup tasks removing records of completed # clusters. Default is 5 minutes. completedClusterCleanupPeriod: 600 # Base delay (in seconds) for backoff when retrying after failures. backoffBaseDelay: 0.1 # Max delay (in seconds) for backoff when retrying after failures. backoffMaxDelay: 300 # Limit on the average number of k8s api calls per second. k8sApiRateLimit: 50 # Limit on the maximum number of k8s api calls per second. k8sApiRateLimitBurst: 100 # The image to use for the controller pod. image: name: <azure_container_registry>/daskgateway/dask-gateway-server tag: 0.9.0 pullPolicy: IfNotPresent # Settings for nodeSelector, affinity, and tolerations for the controller pods nodeSelector: {} affinity: {} tolerations: [] # Configuration for the traefik proxy traefik: # Number of instances of the proxy to run replicas: 1 # Any annotations to add to the proxy pods annotations: {} # Resource requests/limits for the proxy pods resources: {} # The image to use for the proxy pod image: name: traefik tag: 2.1.3 # Any additional arguments to forward to traefik additionalArguments: [] # The proxy log level loglevel: WARN # Whether to expose the dashboard on port 9000 (enable for debugging only!) dashboard: false # Additional configuration for the traefik service service: type: LoadBalancer annotations: {} spec: {} ports: web: # The port HTTP(s) requests will be served on port: 80 nodePort: null tcp: # The port TCP requests will be served on. Set to `web` to share the # web service port port: web nodePort: null # Settings for nodeSelector, affinity, and tolerations for the traefik pods nodeSelector: {} affinity: {} tolerations: [] rbac: # Whether to enable RBAC. enabled: true # Existing names to use if ClusterRoles, ClusterRoleBindings, and # ServiceAccounts have already been created by other means (leave set to # `null` to create all required roles at install time) controller: serviceAccountName: null gateway: serviceAccountName: null traefik: serviceAccountName: null
Best Regards, Aravind @jacobtomlinson just to clarify if we are accessing the gateway within the cluster there is no need for traefik components to be deployed,right? as I mentioned in the issue description I had removed the yaml files in template/traefik folder as the cluster already had a traefik deployment.
@aravindp ah I missed that! So you've modified the chart? I think you will need the Traefik components here, it is being used here to do some specific proxying of the scheduler.
@aravindp ah I missed that! So you've modified the chart? I think you will need the Traefik components here, it is being used here to do some specific proxying of the scheduler.
@jacobtomlinson , yes, I modified it since we already have traefik ingress configured in our cluster. In that case, how should I do it? From the documentation it was not clear for me, how should I integrate it with an already existing traefik ingress.
Dask gateway does not use traefik as an ingress, just as a service to proxy traffic.
Configure it the same way you would any other service.
@jacobtomlinson I added back the below files from the traefik templates folder.
- template/traefik/dashboard.yaml
- template/traefik/service.yaml
I believe deployment & rbac yamls are not required as the cluster already has traefik ingress.
After the new deployment traefik load balancer services is also available.
But still when I try to connect to the url using the external ip of the service getting 403 error
When I try to do curl, the error has some more details
@aravindrp please do not modify the YAML in the chart. It makes it much harder for us to test and support it.
If you need to disable things please do so in the config, and if it's not possible to disable in config then please raise an issue so we can get that fixed.
Please could you try installing the vanilla chart without modifications and let us know how you get on.
@aravindrp please do not modify the YAML in the chart. It makes it much harder for us to test and support it.
If you need to disable things please do so in the config, and if it's not possible to disable in config then please raise an issue so we can get that fixed.
Please could you try installing the vanilla chart without modifications and let us know how you get on.
@jacobtomlinson sorry for the delay in responding. Used this shortcut method of manually modifying the helm charts as I was in the phase of evaluating dask gateway. I am planning to do it in the proper way during the final implementation.
I was doing some further analysis and it feels like a problem with aiohttp package used within dask-gateway. Because when I try to curl or use urllib3 package and try to connect with the api server its working without any issues. I have raised a ticket in the stackoverflow for this. Meanwhile, have you seen any simiar behavior before?
https://stackoverflow.com/questions/67115594/forbidden-error-while-trying-to-access-a-url-using-aiohttp
@aravindrp thanks for summarizing that this may be related to aiohttp
. It may be relevant to know its being updated in #423 but we need also to get a release out after that is merged.
As we have not arrived at a clear action point to take with regards to the code in this repo, I suggest we close this issue at this point.