terraform-provider-iterative Standardize on container images instead of machine images

Standardize on container images instead of machine images

Open 0x2b3bfa0 opened this issue 3 years ago • 6 comments

Follow-up of https://github.com/iterative/terraform-provider-iterative/issues/127#issuecomment-863408982

It would be nice to offer a single, consistent environment on every platform, and we can ship default container images as part of the machine images to avoid pull delays and costs.

This proposal assumes that:

The user–provided code is intended to (or at least can) run on Linux.
Users who have on–premises GPU farms are able to install Docker.

I'm inclined to think that those assumptions are pretty reasonable, and a good compromise between impact and effort on our side.

Jun 17 '21 18:06 0x2b3bfa0

If future versions of CRIU support loading/restoring the internal state of CUDA devices, standardizing on containers could have the additional advantage of allowing us to perform live migrations between spot instances. The advantages versus data-based checkpoints aren't especially obvious, but it looks like the next cool technology. 😄 See also https://github.com/iterative/terraform-provider-iterative/issues/176#issuecomment-895329370

Aug 16 '21 19:08 0x2b3bfa0

Blockers for containerized `cml runner`

From all the continuous integration systems we support,[^1] GitHub Actions is the only that doesn't play nicely with containerized self-hosted runners:

https://github.com/actions/runner/issues/406
https://github.com/actions/runner/issues/367
https://github.com/iterative/cml/issues/908#issuecomment-1063495461

[^1]: Namely, GitHub Actions, GitLab CI/CD and Bitbucket Pipelines.

Sep 30 '21 13:09 0x2b3bfa0

Machine images offered by providers have lots of quirks and don't include any of the helper tools we need to offer a good user experience.

Custom images are the only alternative to provisioning instances on the fly, but forcing users to run tasks in a fixed environment could be unwise. Especially when it implies committing to build and maintain a stable and secure reference image.

Resposiveness-wise, the most appropriate solution would be using containers or lightweight virtual machines with user-specified images, including some default general purpose images with our custom machine images in order to reduce load times.

https://github.com/nestybox/sysbox/issues/50

https://github.com/kata-containers/documentation/blob/master/use-cases/Nvidia-GPU-passthrough-and-Kata.md

https://github.com/firecracker-microvm/firecracker/issues/1179

https://blog.cloudkernels.net/posts/vaccel_v2/

https://github.com/google/gvisor/issues/14

Moved from the experimental XPD library.

Nov 24 '21 16:11 0x2b3bfa0

do you mean allow resource "iterative_task" { image = "docker://..." }?

Apr 21 '22 04:04 casperdcl

This issue predates the iterative_task resource, but yes.

Apr 21 '22 04:04 0x2b3bfa0

allow resource "iterative_task" { image = "docker://..." }

🪓

terraform {
  required_providers {
    iterative = { source = "iterative/iterative" }
  }
}

provider "iterative" {}

resource "iterative_task" "example" {
  cloud   = "aws"
  image   = "nvidia"
  machine = "g4dn.xlarge"

  script = <<-END
    #!/usr/bin/env -S sh -c 'docker run --rm -iv "$(realpath "$0"):/file" alpine sh /file'
    cat /etc/alpine-release
  END
}

Apr 21 '22 06:04 0x2b3bfa0

terraform-provider-iterative terraform-provider-iterative copied to clipboard

Standardize on container images instead of machine images

Follow-up of https://github.com/iterative/terraform-provider-iterative/issues/127#issuecomment-863408982

Blockers for containerized cml runner

terraform-provider-iterative
terraform-provider-iterative copied to clipboard

Blockers for containerized `cml runner`