[UX] Detect spot instance interruptions earlier
Problem
If a spot instance is interrupted, dstack will only detect the interruption after a period of instance being unreachable:
- If the job is running, it will be terminated after 2 minutes.
- If the job is provisioning, it will be terminated after up to 20 minutes (often happens on RunPod).
- If the instance is idle or busy, it will be marked as unreachable and deleted from
dstackafter 20 minutes.
This results in a few problems:
- Poor UX — the user has no visibility into why an instance is unreachable or has to wait for up to 20 minutes before their run provisioning fails.
- Less frequent retries, possibly leading to increased service downtime.
Solution
Detect interruptions immediately using backend APIs, without relying on timeouts.
Implementation note
Add a method, such as get_instance or is_instance_alive to backends.base.compute.Compute. Method implementations should check that the instance with the specified id exists and has an appropriate status.
Use the method to detect interruptions in the following cases:
- Whenever shim health check fails on an idle or busy instance.
- Periodically during instance provisioning.
If an interruption is detected, terminate the instance immediately.
If necessary, make the method implementation optional (move to a mixin in backends.base.compute) and fall back to timeouts for backends where the method is not implemented.