dask-kubernetes icon indicating copy to clipboard operation
dask-kubernetes copied to clipboard

Cleanup pending pods on scale down

Open BitTheByte opened this issue 2 years ago • 2 comments

Currently, the operator retires workers using the HTTP or RPC APIs however those only control the connected dask workers, the operator should take into count dask's Kubernetes worker pods that are in a pending state as those will cause a useless Kubernetes cluster scale-up and then connect to dask and get retired thus a scale down should retire active workers and prevent pending pods from entering running state

BitTheByte avatar Sep 16 '23 14:09 BitTheByte

Agreed.

We could add a check here for any Pods that aren't in a Running phase and delete those before calling retire_workers (if that's even necessary any more).

https://github.com/dask/dask-kubernetes/blob/92714da5785709726f85c4c6ec92451f5c23ad04/dask_kubernetes/operator/controller/controller.py#L599-L600

jacobtomlinson avatar Sep 18 '23 13:09 jacobtomlinson

Looks good to me, we should also subtract pending workers from the number of workers passed to retire_workers

BitTheByte avatar Sep 18 '23 13:09 BitTheByte