[UX] Detect spot instance interruptions earlier

Open jvstme opened this issue 6 months ago • 2 comments

Problem

If a spot instance is interrupted, dstack will only detect the interruption after a period of instance being unreachable:

If the job is running, it will be terminated after 2 minutes.
If the job is provisioning, it will be terminated after up to 20 minutes (often happens on RunPod).
If the instance is idle or busy, it will be marked as unreachable and deleted from dstack after 20 minutes.

This results in a few problems:

Poor UX — the user has no visibility into why an instance is unreachable or has to wait for up to 20 minutes before their run provisioning fails.
Less frequent retries, possibly leading to increased service downtime.

Solution

Detect interruptions immediately using backend APIs, without relying on timeouts.

Implementation note

Add a method, such as get_instance or is_instance_alive to backends.base.compute.Compute. Method implementations should check that the instance with the specified id exists and has an appropriate status.

Use the method to detect interruptions in the following cases:

Whenever shim health check fails on an idle or busy instance.
Periodically during instance provisioning.

If an interruption is detected, terminate the instance immediately.

If necessary, make the method implementation optional (move to a mixin in backends.base.compute) and fall back to timeouts for backends where the method is not implemented.

Jun 03 '25 11:06 jvstme