hail
hail copied to clipboard
[batch] reproduce Ben's non-responsive worker issue and convert to a test
What happened?
Ben W reports that he can reliably cause a batch worker VM to become non-responsive, triggering the driver to kill the VM, and the job to get rescheduled.
https://hail.zulipchat.com/#narrow/stream/300487-Hail-Batch-Dev/topic/workers.20which.20suddenly.20stop.20responding/near/400852561
This ticket is complete when:
- We have reproduced Ben's behavior on a main commit before or including 06183480d2.
- We have reduced Ben's test case to something we can add as a test.
Version
0.2.126
Relevant log output
No response