dstack icon indicating copy to clipboard operation
dstack copied to clipboard

TPUs may be interrupted immediately after provisioning leading to suboptimal retry

Open r4victor opened this issue 1 year ago • 0 comments

While testing TPUs provisioning, I noticed that both on-demand and spot TPUs can be deleted right after a successful call to create the TPU. The server correctly fails the job with FAILED_TO_START_DUE_TO_NO_CAPACITY so it can be retried with retry. But retry is likely to try the same offers leading to suboptimal retry.

Perhaps we can introduce a local cache of failed offers or randomize offers order (e.g. regions/zones) to fix this.

r4victor avatar Jun 26 '24 11:06 r4victor