dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Feature]: Support Runpod network volumes

Open dinosaursarecool opened this issue 1 year ago • 5 comments

Problem

In order to get the most out of Runpod deployments, it would be amazing to have support for things like network storage, selecting specific data centers, or specifying a template_id that includes a lot of the existing configuration.

The create_pod [link] function in the api_client of the runpod backend accepts these parameters, namely template_id, data_center_id, network_volume_id, however when defined in a configuration, e.g. as example.dstack.yml:

type: task

spot_policy: auto
template_id: runpod-torch-v21
data_center_id: EU-RO-1

backends: [runpod]

dstack run . -f example.dstack.yml fails with:

3 validation errors for RunConfigurationRequest
__root__ -> TaskConfigurationRequest -> data_center_id
  extra fields not permitted (type=value_error.extra)
__root__ -> TaskConfigurationRequest -> template_id
  extra fields not permitted (type=value_error.extra)
__root__ -> TaskConfigurationRequest -> __root__
  Either `commands` or `image` must be set (type=value_error)

There are 2 problems with this:

  1. It appears the configuration values such as template_id, data_center_id, network_volume_id are not picked up as valid variables.
  2. On a philosophical level there's a question if image or command should be required to be defined in the dstack task itself if a runpod template is used (i.e., there is a template_id reference), as that template will already define the image and command. My biased view is that the template should override what's in the dstack configuration, but I think either way it's workable so it has little practical importance and might more come down to what's more suitable according to the principles of the dstack architecture.

Having support for (1) would be incredibly helpful as it enabled network volume usage on runpod which enables usage of dstack for large(r) scale deployments where downloading remote models for each instance is too expensive.

Solution

Add support for runpod variables to the dstack configuration. Pass those variables to the runpod backend and the create_pod function.

Workaround

None to my knowledge, but I recognize there's an open issue for general volume support https://github.com/dstackai/dstack/issues/1158 which would alleviate some of these pains. However, having support for these configuration variables in general seems like a quick win to increase runpod adoption

Would you like to help us implement this feature by sending a PR?

No

dinosaursarecool avatar Jun 02 '24 17:06 dinosaursarecool

@dinosaursarecool Thank you very much for the request. Here's a few questions that may help us move forward with this:

  1. data_center_id. AFAIK, dstack supports this via regions:
type: task

spot_policy: auto
regions: [EU-RO-1]
backends: [runpod]
  1. network_volume_id this feature is planned as a part of https://github.com/dstackai/dstack/issues/1158

First, we'll support AWS and GCP and after that we're also happy to support RunPod too!

  1. As to template_id, is there anything that you need template_id what dstack doesn't support? I wonder why you many nee to use templates? You can specify everything via commands and you repo files. Please let me know!

peterschmidt85 avatar Jun 03 '24 09:06 peterschmidt85

@peterschmidt85 Thanks, got it. Yeah I think everything should be achievable through the current dstack configuration except for volumes. So if volume support is solved in #1158 then I can see how we could consider template support to be superfluous

dinosaursarecool avatar Jun 03 '24 19:06 dinosaursarecool

@dinosaursarecool Don't mind if we update the title/description of this issue to focus on just volumes with RunPod?

peterschmidt85 avatar Jun 04 '24 07:06 peterschmidt85

@peterschmidt85 absolutely, updated the title

dinosaursarecool avatar Jun 04 '24 21:06 dinosaursarecool

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 avatar Jul 05 '24 01:07 peterschmidt85

@dinosaursarecool, the support for runpod network volumes is in master. Give it a try! It will be coming in the next 0.18.7 release within two weeks.

r4victor avatar Jul 19 '24 10:07 r4victor