dask-kubernetes
dask-kubernetes copied to clipboard
Runner pod for DaskJob fails to spawn
Describe the issue:
The runner pod for DaskJobs fails to spawn when a DaskJob is deleted and then re-created again quickly.
Minimal Complete Verifiable Example:
- Create a DaskJob using the example yaml from the Dask documentation.
kubectl apply -f daskjob.yaml
- Wait for the runner pod to start.
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/test-simple-job-default-worker-8911716d53-7f8dc4897-tlqm2 1/1 Running 0 5s
pod/test-simple-job-default-worker-ae18a247f6-64d8f6d6d7-xlf4m 1/1 Running 0 5s
pod/test-simple-job-runner 1/1 Running 0 6s
pod/test-simple-job-scheduler-7bc7cfb9b7-jlbb6 0/1 Running 0 5s
- Delete the DaskJob.
kubectl delete -f daskjob.yaml
- Quickly re-create the DaskJob again.
kubectl apply -f daskjob.yaml
Anything else we need to know?:
This doesn't affect the scheduler or worker pods because they have a unique suffix appended to their names. The runner pod does not. See this code that generates the runner pod's name: https://github.com/dask/dask-kubernetes/blob/c7839098e1c88f99d8110477981f9e7f3e6f49cc/dask_kubernetes/operator/controller/controller.py#L171
Environment:
- Dask operator version: 2023.9.0