dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: dstack may show interrupted spot instances as provisioning for a long time

Open r4victor opened this issue 1 year ago • 0 comments

Steps to reproduce

When testing spot TPU provisioning, I got runs sometimes stuck in provisioning even though the TPU was interrupted by GCP immediately after it provisioned.

The reason is that the server sets INTERRUPTED_BY_NO_CAPACITY on a job only after it connected to shim, which has ~10min to become available. The server cannot differentiate between a shim not being available yet and instance already being interrupted.

The only fix I can think of is to introduce backend method that would return instance status.

Actual behaviour

No response

Expected behaviour

No response

dstack version

master

Server logs

No response

Additional information

No response

r4victor avatar Jun 26 '24 10:06 r4victor