torchx icon indicating copy to clipboard operation
torchx copied to clipboard

RFC: Improve OCI Image Python Tooling

Open d4l3k opened this issue 3 years ago • 1 comments

Description

Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these easier.

Container based services:

  • Kubernetes / Volcano scheduler
  • AWS EKS / Batch
  • Google AI Platform training
  • Recent versions of slurm https://slurm.schedmd.com/containers.html

TorchX currently supports patches on top of existing images to make it fast to iterate and then launch a training job. These patches are just overlaying files from the local directory on top of a base image. Our current patching implementation relies on having a local docker daemon to build a patch layer and push it: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L437-L493

Ideally we could build a patch layer and push it in pure Python without requiring any local docker instances since that's an extra burden on ML researchers/users. Building a patch should be fairly straightforward since it's just appending to a layer and pushing will require some ability to talk to the registry to download/upload containers.

It seems like OCI containers are a logical choice to use for packaging ML training jobs/apps but the current Python tooling is fairly lacking as far as I can see. Making it easier to work with this will likely help with the cloud story.

Detailed Proposal

Create a library for Python to manipulate OCI images with the following subset of features:

  • download/upload images to OCI repos
  • append layers to OCI images

Non-goals:

  • Execute containers
  • Dockerfiles

Alternatives

Additional context/links

There is an existing oci-python library but it's fairly early. May be able to build upon it to enable this.

I opened an issue there as well: https://github.com/vsoch/oci-python/issues/15

d4l3k avatar Feb 11 '22 04:02 d4l3k

I think I stumbled across this limitation just now. Was trying to get torchx running with a fresh k8s cluster using CRI-O instead of docker/containerd as the runtime and it always fails when trying to pull the image (which I immagine being only the first of a few "problematic" steps).

~$ torchx run -s kubernetes dist.ddp --script compute_world_size/main.py -j 1x1
torchx 2023-01-23 14:47:40 INFO     loaded configs from /home/user/playground/torchx_examples/torchx/examples/apps/.torchxconfig
torchx 2023-01-23 14:47:40 INFO     Checking for changes in workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps`...
torchx 2023-01-23 14:47:40 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2023-01-23 14:47:40 INFO     Workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps` resolved to filesystem path `/home/user/playground/torchx_examples/torchx/examples/apps`
torchx 2023-01-23 14:47:40 WARNING  failed to pull image ghcr.io/pytorch/torchx:0.4.0, falling back to local: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
torchx 2023-01-23 14:47:40 INFO     Building workspace docker image (this may take a while)...

... [trace left out, can attach it if required]

Could you please confirm this is actually related to the issue you are talking about? If it indeed is, will it be enough to install the docker runtime in parallel, just to get the toolchain in the back up and running? Also, are there any other steps required to get such a setup running?

Best regards

Migsi avatar Jan 23 '23 14:01 Migsi