Retire pending workers
Based on discussion at https://github.com/dask/dask-kubernetes/issues/817
Everything looks good at least for now. I deployed the change to our production cluster and will provide updates if needed. Hopefully, everything works well 😄
Surprisingly I had the most stable run ever. One note to mention is if a pod is restarting which means it's deployment in an unready state a small possibility might happen:
- The operator gets the unready state
- pod restarts fast enough and starts executing work
- The operator kills the pod mid-run
So I was thinking of a way to execute the logic only on deployment that have pods in a pending state ~~however I can't find a way to do that using kr8s~~
Done! now the operator takes actions based on pending pods rather than the deployments
Sorry for the long delay, unfortunately, I don't have much time to work on this. meanwhile, I'll close this PR to allow someone else to finish this
@jacobtomlinson I believe this is ready to merge
It is something related to building some go code. I tried to check what is wrong but didn't figure it out. it appears to be something related to the CI itself as I see most of the PRs are falling too.
@jacobtomlinson If possible can you suggest any solution to solve this issue?
Thanks for being patient here. I've nudged the CI back into a happy state and pulled main into this PR. So hopefully now any failures will only be related to this PR.
Ok the tests are still failing here so they are certainly hanging due to the changes in this PR.
Seems like they're timeout issues not something failing
Sure but I expect the CI is timing out because the tests are hanging. The tests on main do not hang, which suggests the changes here cause the tests to hang.