terraform-provider-iterative icon indicating copy to clipboard operation
terraform-provider-iterative copied to clipboard

Standardize on container images instead of machine images

Open 0x2b3bfa0 opened this issue 3 years ago • 6 comments

Follow-up of https://github.com/iterative/terraform-provider-iterative/issues/127#issuecomment-863408982

It would be nice to offer a single, consistent environment on every platform, and we can ship default container images as part of the machine images to avoid pull delays and costs.

This proposal assumes that:

  • The user–provided code is intended to (or at least can) run on Linux.
  • Users who have on–premises GPU farms are able to install Docker.

I'm inclined to think that those assumptions are pretty reasonable, and a good compromise between impact and effort on our side.

0x2b3bfa0 avatar Jun 17 '21 18:06 0x2b3bfa0

If future versions of CRIU support loading/restoring the internal state of CUDA devices, standardizing on containers could have the additional advantage of allowing us to perform live migrations between spot instances. The advantages versus data-based checkpoints aren't especially obvious, but it looks like the next cool technology. 😄 See also https://github.com/iterative/terraform-provider-iterative/issues/176#issuecomment-895329370

0x2b3bfa0 avatar Aug 16 '21 19:08 0x2b3bfa0

Blockers for containerized cml runner

From all the continuous integration systems we support,[^1] GitHub Actions is the only that doesn't play nicely with containerized self-hosted runners:

  • https://github.com/actions/runner/issues/406
  • https://github.com/actions/runner/issues/367
  • https://github.com/iterative/cml/issues/908#issuecomment-1063495461

[^1]: Namely, GitHub Actions, GitLab CI/CD and Bitbucket Pipelines.

0x2b3bfa0 avatar Sep 30 '21 13:09 0x2b3bfa0

Machine images offered by providers have lots of quirks and don't include any of the helper tools we need to offer a good user experience.

Custom images are the only alternative to provisioning instances on the fly, but forcing users to run tasks in a fixed environment could be unwise. Especially when it implies committing to build and maintain a stable and secure reference image.

Resposiveness-wise, the most appropriate solution would be using containers or lightweight virtual machines with user-specified images, including some default general purpose images with our custom machine images in order to reduce load times.

  • https://github.com/nestybox/sysbox/issues/50
  • https://github.com/kata-containers/documentation/blob/master/use-cases/Nvidia-GPU-passthrough-and-Kata.md
  • https://github.com/firecracker-microvm/firecracker/issues/1179
  • https://blog.cloudkernels.net/posts/vaccel_v2/
  • https://github.com/google/gvisor/issues/14

Moved from the experimental XPD library.

0x2b3bfa0 avatar Nov 24 '21 16:11 0x2b3bfa0

do you mean allow resource "iterative_task" { image = "docker://..." }?

casperdcl avatar Apr 21 '22 04:04 casperdcl

This issue predates the iterative_task resource, but yes.

0x2b3bfa0 avatar Apr 21 '22 04:04 0x2b3bfa0

allow resource "iterative_task" { image = "docker://..." }

🪓

terraform {
  required_providers {
    iterative = { source = "iterative/iterative" }
  }
}

provider "iterative" {}

resource "iterative_task" "example" {
  cloud   = "aws"
  image   = "nvidia"
  machine = "g4dn.xlarge"

  script = <<-END
    #!/usr/bin/env -S sh -c 'docker run --rm -iv "$(realpath "$0"):/file" alpine sh /file'
    cat /etc/alpine-release
  END
}

0x2b3bfa0 avatar Apr 21 '22 06:04 0x2b3bfa0