dask-kubernetes
dask-kubernetes copied to clipboard
KubeCluster looking for scheduler at wrong port when using NodePort service
What happened: KubeCluster times out when creating a cluster with NodePort service because it's looking for scheduler at 8786 when port is actually a randomized port (e.g. 32367). Note, the scheduler pod does start correctly.
Python test client error:
OSError: Timed out during handshake while connecting to tcp://10.20.0.241:8786 after 10 s
NodePort service in Rancher:
What you expected to happen:
For KubeCluster to use the randomized port to contact the scheduler pod instead of port 8786.
Minimal Complete Verifiable Example:
test_kubecluster.py:
import dask
from dask_kubernetes import KubeCluster, KubeConfig
auth = KubeConfig(config_file="~/.kube/remote")
dask.config.set({"kubernetes.scheduler-service-type": "NodePort"})
cluster = KubeCluster('worker-spec.yml', auth=auth, deploy_mode='remote')
worker-spec.yml:
# worker-spec.yml
kind: Pod
metadata:
labels:
foo: bar
spec:
restartPolicy: Never
containers:
- image: ghcr.io/dask/dask:latest
imagePullPolicy: IfNotPresent
args: [dask-worker, $(DASK_SCHEDULER_ADDRESS), --nthreads, '2', --no-dashboard, --memory-limit, 4GB, --death-timeout, '60']
name: dask-worker
env:
- name: EXTRA_PIP_PACKAGES
value: git+https://github.com/dask/distributed
resources:
limits:
cpu: "2"
memory: 4G
requests:
cpu: "2"
memory: 4G
Anything else we need to know?:
Environment:
- Dask version: 2021.3.0=pyhd8ed1ab_0
- Dask core: 2021.3.0=pyhd8ed1ab_0
- Dask kubernetes: 2022.7.0=pyhd8ed1ab_0
- Python version: 3.8.8=h7840368_0_cpython
- Operating System: Windows 10
- Install method (conda, pip, source): conda-forge
Cluster Dump State:
Parent forum post with @jacobtomlinson
https://dask.discourse.group/t/kubecluster-provisions-pod-but-times-out-before-returning-cluster-object/1049
I think this line is the cuprit
https://github.com/dask/dask-kubernetes/blob/8bd508394ba1981186115c18da3dfbdce536226b/dask_kubernetes/common/networking.py#L34