dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[UX] Detect spot instance interruptions earlier

Open jvstme opened this issue 6 months ago • 2 comments

Problem

If a spot instance is interrupted, dstack will only detect the interruption after a period of instance being unreachable:

  • If the job is running, it will be terminated after 2 minutes.
  • If the job is provisioning, it will be terminated after up to 20 minutes (often happens on RunPod).
  • If the instance is idle or busy, it will be marked as unreachable and deleted from dstack after 20 minutes.

This results in a few problems:

  • Poor UX — the user has no visibility into why an instance is unreachable or has to wait for up to 20 minutes before their run provisioning fails.
  • Less frequent retries, possibly leading to increased service downtime.

Solution

Detect interruptions immediately using backend APIs, without relying on timeouts.

Implementation note

Add a method, such as get_instance or is_instance_alive to backends.base.compute.Compute. Method implementations should check that the instance with the specified id exists and has an appropriate status.

Use the method to detect interruptions in the following cases:

  • Whenever shim health check fails on an idle or busy instance.
  • Periodically during instance provisioning.

If an interruption is detected, terminate the instance immediately.

If necessary, make the method implementation optional (move to a mixin in backends.base.compute) and fall back to timeouts for backends where the method is not implemented.

jvstme avatar Jun 03 '25 11:06 jvstme