[Feature]: Ability to start and stop instances without terminating
Problem
Currently, when we start a run an instance is created and when we stop the instance gets terminated/destroyed without saving any state to disk. I want the instance to stop/shutdown while retaining the disk and the ability to start the same instance again. This is similar to pressing start/stop buttons in gcp, aws, azure, vast ai, where we are only billed for persistent disk/network volume.
Solution
Add a new parameter (lets say idle warm time). It determines how much time the instance remains in start state after receiving last request in a service and stops it after idle warm time without terminating. When there are no more requests after idle warm time the service can scale down and terminate. In providers like vast ai, resuming will be subject to GPU availability. I providers like gcp we need to manage ephemeral(non static) ip addresses.
Benefit
Can help with cold starts when in a production deployment. Can help save state and reuse later while cutting GPU costs and only billed for storage, static ip addresses.
Alternatives
Similar to how server-less infrastructure works
Would you like to help contributing this feature?
Yes
@iRohith, thanks for the issue! I can see two requests here that I'd like to address separately:
- Solve the problem of cold starts when upscaling service replicas (currently in development #998).
- Allow to pause and resume runs with state persistence (e.g. useful for dev environments, tasks).
If my understanding is correct, the main driver for this issue is a cold start problem (correct me if I'm wrong). I'd like to suggest an alternative solution to a cold start problem that may better fit into dstack and be easier to implement.
The idea is to let users specify directories for services (aka cache) that would be mounted in the container and retained across service replicas starts. So a user would specify directories like this:
type: service
commands: ...
replicas: 2
cache:
- "/root/.cache/pip"
- "/root/.cache/huggingface"
Then, on replica downscaling, the instance will be terminated but the network storage volume will be kept. On replica upscaling, the volume will be attached to a newly created instance.
The approach you're suggesting cam also solve the cold start problem but also allows for "pausing" of dev environments and tasks, which is great. However, it comes with some UX challenges caused by the guarantee to preserve the container state. For example, in case of no availability when trying to resume a stopped replica/job, we cannot simply start a new replica in a different region/cloud because it will start with fresh state. In case of a cache, we can start a fresh instance – it should only affect the start time.
So my suggestion would be to start by implementing the "cache" feature. The work would mostly be to support storage volume detachment/attachment in selected backends (e.g. gcp), and some supplementary logic to specify cache dirs and mount them in the container.
The "pause" feature can be thought out and supported later reusing the same backend functionality used by "cache".
@iRohith, please tell if this plan works for you. We can then discuss the implementation details.
I would like to cache the whole docker image on the attached volume. So instead of network volumes can we use boot disks on supported platforms to cache whole instance state and then reuse later. It also further reduces cold start time. Otherwise the plan is a great start to have and we can implement pause resume later if needed. Network volume can also be as an addon
I see! We perviously discussed two UX approaches: "pause/resume" and "cache". So now we're talking about two approaches to implement them:
- "pause/resume" of instances on supported providers such as AWS, GCP, Azure.
- "detaching/attaching" of network volumes.
There are different pros/cons to both. "Detaching/attaching" is supported by more providers, so it's more general in this respect. Also, there are more chances the instance will be provisioned with the persistent volume attached since you're not tied to a particular instance type – only to cloud/region. I believe we can also cache the entire Docker data-root, which includes the pulled images, to the attached volume, so it's not an issue.
But I can see reasons to go with the "pause/resume" approach you're suggesting. First, it should be quicker to restart an instance than booting a new one since there is no need to perform the full initialization. (It's not possible to attach an existing EBS volume as a root device, so the init has to happen). If the primary motivation is to help with cold starts – this is a big point. Also, I can see that "pause/resume" approach can be much easier to implement and fit into the current dstack codebase – no need to handle network volumes.
Since we'll add support for different providers gradually, from the implementation perspective I'd indeed consider going with "pause/resume" of instances first, e.g. for AWS, GCP, Azure. But from the UX perspective we should start with "cache" – the resumed instance will be used to start a new container with cached directories mounted (not restarting a stopped one).
Next, we could start implementing "detaching/attaching" of network volumes for providers that don't have "pause"/"resume", and adding this support to AWS, GCP, Azure as well to increase provisioning chances if the fastest method – "resuming" – fails.
"pause/resume" of runs can also be supported on top of new functionality in addition to the "cache" feature.
@iRohith, FYI, we've already added a prototype for resumable runs on GCP but dropped it (#590). It relied on stopping/starting of GCP instances and restarting a stopped docker container. The code is mostly obsolete, but GCP-related Compute functionality may be relevant.
The question I'm still pondering upon is whether to add cache support for dev environments and tasks initially. In case of a service replica, it's reasonable to keep instances/volumes while the run is running. In case of dev environment and tasks, the cache would only make sense across run starts – so caching would be tied to run configurations. We can introduce cache idle time to avoid keeping it forever but there still be a UX difference compared to services.
@iRohith Let me make a summary of what we propose:
- Not implement the
pause/resumefeature now. - Instead, start by implementing the
cachefeature (leveraging network volumes). This will cache any downloaded data, such as Docker images, model weights, etc. This is going to reliably speed up the start-up time across various providers. First, it will be implemented for AWS, GCP, and Azure. Later we can implement it for other providers too, incl. Vast.ai.
Please confirm if you agree with the plan. If so, we can prioritize it and work on it together.
@peterschmidt85 @r4victor I agree with the plan to prioritize implementing the cache feature instead of the pause/resume feature. Let's proceed with this plan and work on it together. Also, I think we can add an option to use object storage like s3 with tools like rclone to store common cache (like model weights and not docker images) across cloud providers. Is this possible if we are going to use network volumes anyway? We can cache files locally too in the background on start if performance is an issue using rclone cache.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.