dask-kubernetes icon indicating copy to clipboard operation
dask-kubernetes copied to clipboard

Dask Auto scaler failing to create

Open LuanAraldi opened this issue 1 year ago • 8 comments

I'm trying to setup a simple DaskAutoscaler on Kubernetes using YAML files, but somehow the auto scaler failes to be created with the following error

  Error  Logging  45s  kopf  Timer 'daskautoscaler_adapt' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 850, in daskautoscaler_adapt
    scheduler = await Pod.get(
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 186, in get
    raise NotFoundError(f"Could not find {cls.kind} {name}.")
kr8s._exceptions.NotFoundError: Could not find Pod None.
  Error  Logging  2s  kopf  Handler 'daskautoscaler_create' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 841, in daskautoscaler_create
    autoscaler = await DaskAutoscaler(body)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 45, in __init__
    raise ValueError("resource must be a dict or a string")
ValueError: resource must be a dict or a string

The autoscaler.yaml file that I am using is this one

apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  namespace: dask
  name: autoscaled
spec:
  cluster: autoscaled
  minimum: 1 
  maximum: 5 

the cluster YAML definition is as follow

apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: autoscaled
  namespace: dask
spec:
  worker:
    replicas: 0
    spec:
      serviceAccountName: dask-operator-sa
      tolerations:  
      - key: dedicated
        operator: Equal
        value: dask-worker
      nodeSelector: 
        dedicated: dask-worker
      containers:
      - name: worker
        image: "ghcr.io/dask/dask:latest"
        imagePullPolicy: "IfNotPresent"
        args:
          - dask-worker
          - --name
          - $(DASK_WORKER_NAME)
          - --dashboard
          - --dashboard-address
          - "8788"
        ports:
          - name: http-dashboard
            containerPort: 8788
            protocol: TCP
        env:
          - name: EXTRA_PIP_PACKAGES
            value: pyarrow s3fs
        resources:
          limits:
            cpu: "2"
            memory: "18G"
          requests:
            cpu: "1"
            memory: "16G"
        
  scheduler:
    spec:
      containers:
      - name: scheduler
        image: "ghcr.io/dask/dask:latest"
        imagePullPolicy: "IfNotPresent"
        args:
          - dask-scheduler
        ports:
          - name: tcp-comm
            containerPort: 8786
            protocol: TCP
          - name: http-dashboard
            containerPort: 8787
            protocol: TCP
        readinessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 15
          periodSeconds: 20
        resources:
          limits:
            cpu: "1"
            memory: "3G"
          requests:
            cpu: "1"
            memory: "2G"
        env:
          - name: EXTRA_PIP_PACKAGES
            value: pyarrow s3fs
          - name: DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION
            value: "1.0"
    service:
      type: NodePort
      selector:
        dask.org/cluster-name: autoscaled
        dask.org/component: scheduler
      ports:
      - name: tcp-comm
        protocol: TCP
        port: 8786
        targetPort: "tcp-comm"
      - name: http-dashboard
        protocol: TCP
        port: 8787
        targetPort: "http-dashboard"

Environment:

  • Dask version: 2023.7.0
  • Python version: 3.10.9

LuanAraldi avatar Jul 20 '23 09:07 LuanAraldi