torchx
torchx copied to clipboard
slurm_scheduler: handle OCI images
Description
Add support for running TorchX components via the Slurm OCI interface.
Motivation/Background
Slurm 21.08+ has support for running OCI containers as the environment. This matches well with our other docker/k8s images that we use by default. With workspaces + OCI we can support slurm like the docker based environments.
Detailed Proposal
The new slurm container support doesn't handle the image finding the same way docker/podman does. This means that the images need to be placed on disk in the same way a virutalenv would be supported which would have to be a user configurable path.
This also means that we have to interact with docker/buildah to download the images and export them to an OCI image on disk. There's some extra questions about image management to avoid disk space issues etc.
The cluster would have to be configured with nvidia-container-runtime
for use with GPUs.
Alternatives
Additional context/links
https://slurm.schedmd.com/containers.html