dask-kubernetes icon indicating copy to clipboard operation
dask-kubernetes copied to clipboard

KubeCluster looking for scheduler at wrong port when using NodePort service

Open radioflyer28 opened this issue 1 year ago • 2 comments

What happened: KubeCluster times out when creating a cluster with NodePort service because it's looking for scheduler at 8786 when port is actually a randomized port (e.g. 32367). Note, the scheduler pod does start correctly.

Python test client error:

OSError: Timed out during handshake while connecting to tcp://10.20.0.241:8786 after 10 s

NodePort service in Rancher:
image

What you expected to happen:
For KubeCluster to use the randomized port to contact the scheduler pod instead of port 8786.

Minimal Complete Verifiable Example:

test_kubecluster.py:

import dask
from dask_kubernetes import KubeCluster, KubeConfig

auth = KubeConfig(config_file="~/.kube/remote")
dask.config.set({"kubernetes.scheduler-service-type": "NodePort"})

cluster = KubeCluster('worker-spec.yml', auth=auth, deploy_mode='remote')

worker-spec.yml:

# worker-spec.yml

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: ghcr.io/dask/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, $(DASK_SCHEDULER_ADDRESS), --nthreads, '2', --no-dashboard, --memory-limit, 4GB, --death-timeout, '60']
    name: dask-worker
    env:
      - name: EXTRA_PIP_PACKAGES
        value: git+https://github.com/dask/distributed
    resources:
      limits:
        cpu: "2"
        memory: 4G
      requests:
        cpu: "2"
        memory: 4G

Anything else we need to know?:

Environment:

  • Dask version: 2021.3.0=pyhd8ed1ab_0
  • Dask core: 2021.3.0=pyhd8ed1ab_0
  • Dask kubernetes: 2022.7.0=pyhd8ed1ab_0
  • Python version: 3.8.8=h7840368_0_cpython
  • Operating System: Windows 10
  • Install method (conda, pip, source): conda-forge
Cluster Dump State:

radioflyer28 avatar Aug 25 '22 19:08 radioflyer28

Parent forum post with @jacobtomlinson
https://dask.discourse.group/t/kubecluster-provisions-pod-but-times-out-before-returning-cluster-object/1049

radioflyer28 avatar Aug 25 '22 19:08 radioflyer28

I think this line is the cuprit

https://github.com/dask/dask-kubernetes/blob/8bd508394ba1981186115c18da3dfbdce536226b/dask_kubernetes/common/networking.py#L34

jacobtomlinson avatar Aug 26 '22 08:08 jacobtomlinson