dask-kubernetes icon indicating copy to clipboard operation
dask-kubernetes copied to clipboard

Runner pod for DaskJob fails to spawn

Open creste opened this issue 1 year ago • 1 comments

Describe the issue:

The runner pod for DaskJobs fails to spawn when a DaskJob is deleted and then re-created again quickly.

Minimal Complete Verifiable Example:

  1. Create a DaskJob using the example yaml from the Dask documentation.
kubectl apply -f daskjob.yaml
  1. Wait for the runner pod to start.
$ kubectl get all
NAME                                                             READY   STATUS      RESTARTS   AGE
pod/test-simple-job-default-worker-8911716d53-7f8dc4897-tlqm2    1/1     Running     0          5s
pod/test-simple-job-default-worker-ae18a247f6-64d8f6d6d7-xlf4m   1/1     Running     0          5s
pod/test-simple-job-runner                                       1/1     Running     0          6s
pod/test-simple-job-scheduler-7bc7cfb9b7-jlbb6                   0/1     Running     0          5s
  1. Delete the DaskJob.
kubectl delete -f daskjob.yaml
  1. Quickly re-create the DaskJob again.
kubectl apply -f daskjob.yaml

Anything else we need to know?:

This doesn't affect the scheduler or worker pods because they have a unique suffix appended to their names. The runner pod does not. See this code that generates the runner pod's name: https://github.com/dask/dask-kubernetes/blob/c7839098e1c88f99d8110477981f9e7f3e6f49cc/dask_kubernetes/operator/controller/controller.py#L171

Environment:

  • Dask operator version: 2023.9.0

creste avatar Oct 05 '23 20:10 creste