Retire pending workers
Based on discussion at https://github.com/dask/dask-kubernetes/issues/817
I believe https://github.com/dask/dask-kubernetes/pull/877 has implemented the same logic in this PR.
Have you been running this in a real world setting?
I've been running this PR on a 300 worker cluster for a somewhile and it worked effortlessly.
Do you have any thoughts about how we could test this?
Unfortunately, No since I'm not familiar with dask's test suites. but generally a test should launch a n of unschedulable workers and wait for the controller to downscale them to the correct number.
I just bumped this PR to use the latest code from main but something in here is still causing the CI to hang. It's not obvious to me what is causing it, but we will need to get to the bottom of it to be able to get this merged. @BitTheByte do you think you will have some time to try and push this over the line?