dstack
dstack copied to clipboard
[Bug]: dstack may show interrupted spot instances as provisioning for a long time
Steps to reproduce
When testing spot TPU provisioning, I got runs sometimes stuck in provisioning even though the TPU was interrupted by GCP immediately after it provisioned.
The reason is that the server sets INTERRUPTED_BY_NO_CAPACITY on a job only after it connected to shim, which has ~10min to become available. The server cannot differentiate between a shim not being available yet and instance already being interrupted.
The only fix I can think of is to introduce backend method that would return instance status.
Actual behaviour
No response
Expected behaviour
No response
dstack version
master
Server logs
No response
Additional information
No response