hail icon indicating copy to clipboard operation
hail copied to clipboard

[batch] reproduce Ben's non-responsive worker issue and convert to a test

Open danking opened this issue 7 months ago • 0 comments

What happened?

Ben W reports that he can reliably cause a batch worker VM to become non-responsive, triggering the driver to kill the VM, and the job to get rescheduled.

https://hail.zulipchat.com/#narrow/stream/300487-Hail-Batch-Dev/topic/workers.20which.20suddenly.20stop.20responding/near/400852561

This ticket is complete when:

  1. We have reproduced Ben's behavior on a main commit before or including 06183480d2.
  2. We have reduced Ben's test case to something we can add as a test.

Version

0.2.126

Relevant log output

No response

danking avatar Nov 08 '23 21:11 danking