dask-kubernetes Ensure workers are recreated if pods are deleted/lost

Closes #603

Added a timer to the daskworkergroup_update method so that the scaling logic is run every 5 seconds as well as when the group is updated. This function checks if the current number of pods matches the expected and creates/removes accordingly.

If a node is terminated and pods are lost this timer should notice the mismatch and correct it.

Also added a test to check for this.

This does leave open the question of what about other resources like the scheduler pod or job runner pod. Should those also be recreated automatically, or should the parent resource go into an error state?

Nov 10 '22 14:11 jacobtomlinson

Should those also be recreated automatically, or should the parent resource go into an error state?

IMO an error state makes more sense for these. I would expect a job to possibly have setup/teardown logic that isn't idempotent, so automatically restarting it seems potentially unsafe, an error state would be safer. Or possibly a configurable number of max retries (defaulting to 0)?

Nov 10 '22 18:11 kwohlfahrt

To keep the load on the API server lower (especially with many worker pods), we should possibly use an index to track the live workers. I'd be happy to contribute a patch for this.

Nov 10 '22 18:11 kwohlfahrt

IMO an error state makes more sense for these.

Appreciate the input on that. I opened #606 to track this conversation further.

I'd be happy to contribute a patch for this.

Nice! I've not used kopf indexes before. Yeah if you want to submit a PR that supersedes this one that would be awesome!

Nov 10 '22 19:11 jacobtomlinson

Hi @jacobtomlinson, is this still in the works, or will another solution exist? We see the same issue mentioned in #603 with our spot instances. I noticed @kwohlfahrt suggested a different fix, but I believe that needs to be implemented. Can your solution be pushed to the release in the meantime?

Jun 07 '23 18:06 tasansal

@tasansal I expect this PR will be superseded by one that @Matt711 is working on this week.

Jun 08 '23 10:06 jacobtomlinson

Superseded

Apr 30 '24 15:04 jacobtomlinson