torchx icon indicating copy to clipboard operation
torchx copied to clipboard

slurm_scheduler: handle OCI images

Open d4l3k opened this issue 3 years ago • 0 comments

Description

Add support for running TorchX components via the Slurm OCI interface.

Motivation/Background

Slurm 21.08+ has support for running OCI containers as the environment. This matches well with our other docker/k8s images that we use by default. With workspaces + OCI we can support slurm like the docker based environments.

Detailed Proposal

The new slurm container support doesn't handle the image finding the same way docker/podman does. This means that the images need to be placed on disk in the same way a virutalenv would be supported which would have to be a user configurable path.

This also means that we have to interact with docker/buildah to download the images and export them to an OCI image on disk. There's some extra questions about image management to avoid disk space issues etc.

The cluster would have to be configured with nvidia-container-runtime for use with GPUs.

Alternatives

Additional context/links

https://slurm.schedmd.com/containers.html

d4l3k avatar Nov 15 '21 23:11 d4l3k