dask-kubernetes Retire pending workers

Based on discussion at https://github.com/dask/dask-kubernetes/issues/817

Sep 18 '23 15:09 BitTheByte

Everything looks good at least for now. I deployed the change to our production cluster and will provide updates if needed. Hopefully, everything works well 😄

Sep 18 '23 17:09 BitTheByte

Surprisingly I had the most stable run ever. One note to mention is if a pod is restarting which means it's deployment in an unready state a small possibility might happen:

The operator gets the unready state
pod restarts fast enough and starts executing work
The operator kills the pod mid-run

So I was thinking of a way to execute the logic only on deployment that have pods in a pending state ~~however I can't find a way to do that using kr8s~~

Sep 18 '23 18:09 BitTheByte

Done! now the operator takes actions based on pending pods rather than the deployments

Sep 18 '23 20:09 BitTheByte

Sorry for the long delay, unfortunately, I don't have much time to work on this. meanwhile, I'll close this PR to allow someone else to finish this

Oct 22 '23 04:10 BitTheByte

@jacobtomlinson I believe this is ready to merge

Jan 08 '24 21:01 BitTheByte

It is something related to building some go code. I tried to check what is wrong but didn't figure it out. it appears to be something related to the CI itself as I see most of the PRs are falling too.

Jan 13 '24 13:01 BitTheByte

@jacobtomlinson If possible can you suggest any solution to solve this issue?

Feb 18 '24 10:02 BitTheByte

Thanks for being patient here. I've nudged the CI back into a happy state and pulled main into this PR. So hopefully now any failures will only be related to this PR.

Feb 20 '24 14:02 jacobtomlinson

Ok the tests are still failing here so they are certainly hanging due to the changes in this PR.

Feb 20 '24 15:02 jacobtomlinson

Seems like they're timeout issues not something failing

Feb 20 '24 16:02 BitTheByte

Sure but I expect the CI is timing out because the tests are hanging. The tests on main do not hang, which suggests the changes here cause the tests to hang.

Feb 20 '24 17:02 jacobtomlinson