dask-kubernetes icon indicating copy to clipboard operation
dask-kubernetes copied to clipboard

Retire pending workers

Open BitTheByte opened this issue 2 years ago • 11 comments

Based on discussion at https://github.com/dask/dask-kubernetes/issues/817

BitTheByte avatar Sep 18 '23 15:09 BitTheByte

Everything looks good at least for now. I deployed the change to our production cluster and will provide updates if needed. Hopefully, everything works well 😄

BitTheByte avatar Sep 18 '23 17:09 BitTheByte

Surprisingly I had the most stable run ever. One note to mention is if a pod is restarting which means it's deployment in an unready state a small possibility might happen:

  1. The operator gets the unready state
  2. pod restarts fast enough and starts executing work
  3. The operator kills the pod mid-run

So I was thinking of a way to execute the logic only on deployment that have pods in a pending state ~~however I can't find a way to do that using kr8s~~

BitTheByte avatar Sep 18 '23 18:09 BitTheByte

Done! now the operator takes actions based on pending pods rather than the deployments

BitTheByte avatar Sep 18 '23 20:09 BitTheByte

Sorry for the long delay, unfortunately, I don't have much time to work on this. meanwhile, I'll close this PR to allow someone else to finish this

BitTheByte avatar Oct 22 '23 04:10 BitTheByte

@jacobtomlinson I believe this is ready to merge

BitTheByte avatar Jan 08 '24 21:01 BitTheByte

It is something related to building some go code. I tried to check what is wrong but didn't figure it out. it appears to be something related to the CI itself as I see most of the PRs are falling too.

BitTheByte avatar Jan 13 '24 13:01 BitTheByte

@jacobtomlinson If possible can you suggest any solution to solve this issue?

BitTheByte avatar Feb 18 '24 10:02 BitTheByte

Thanks for being patient here. I've nudged the CI back into a happy state and pulled main into this PR. So hopefully now any failures will only be related to this PR.

jacobtomlinson avatar Feb 20 '24 14:02 jacobtomlinson

Ok the tests are still failing here so they are certainly hanging due to the changes in this PR.

jacobtomlinson avatar Feb 20 '24 15:02 jacobtomlinson

Seems like they're timeout issues not something failing

BitTheByte avatar Feb 20 '24 16:02 BitTheByte

Sure but I expect the CI is timing out because the tests are hanging. The tests on main do not hang, which suggests the changes here cause the tests to hang.

jacobtomlinson avatar Feb 20 '24 17:02 jacobtomlinson