terraform-provider-iterative
terraform-provider-iterative copied to clipboard
Standardize on container images instead of machine images
Follow-up of https://github.com/iterative/terraform-provider-iterative/issues/127#issuecomment-863408982
It would be nice to offer a single, consistent environment on every platform, and we can ship default container images as part of the machine images to avoid pull delays and costs.
This proposal assumes that:
- The user–provided code is intended to (or at least can) run on Linux.
- Users who have on–premises GPU farms are able to install Docker.
I'm inclined to think that those assumptions are pretty reasonable, and a good compromise between impact and effort on our side.
If future versions of CRIU support loading/restoring the internal state of CUDA devices, standardizing on containers could have the additional advantage of allowing us to perform live migrations between spot instances. The advantages versus data-based checkpoints aren't especially obvious, but it looks like the next cool technology. 😄 See also https://github.com/iterative/terraform-provider-iterative/issues/176#issuecomment-895329370
Blockers for containerized cml runner
From all the continuous integration systems we support,[^1] GitHub Actions is the only that doesn't play nicely with containerized self-hosted runners:
- https://github.com/actions/runner/issues/406
- https://github.com/actions/runner/issues/367
- https://github.com/iterative/cml/issues/908#issuecomment-1063495461
[^1]: Namely, GitHub Actions, GitLab CI/CD and Bitbucket Pipelines.
Machine images offered by providers have lots of quirks and don't include any of the helper tools we need to offer a good user experience.
Custom images are the only alternative to provisioning instances on the fly, but forcing users to run tasks in a fixed environment could be unwise. Especially when it implies committing to build and maintain a stable and secure reference image.
Resposiveness-wise, the most appropriate solution would be using containers or lightweight virtual machines with user-specified images, including some default general purpose images with our custom machine images in order to reduce load times.
- https://github.com/nestybox/sysbox/issues/50
- https://github.com/kata-containers/documentation/blob/master/use-cases/Nvidia-GPU-passthrough-and-Kata.md
- https://github.com/firecracker-microvm/firecracker/issues/1179
- https://blog.cloudkernels.net/posts/vaccel_v2/
- https://github.com/google/gvisor/issues/14
Moved from the experimental XPD library.
do you mean allow resource "iterative_task" { image = "docker://..." }
?
This issue predates the iterative_task
resource, but yes.
allow
resource "iterative_task" { image = "docker://..." }
🪓
terraform {
required_providers {
iterative = { source = "iterative/iterative" }
}
}
provider "iterative" {}
resource "iterative_task" "example" {
cloud = "aws"
image = "nvidia"
machine = "g4dn.xlarge"
script = <<-END
#!/usr/bin/env -S sh -c 'docker run --rm -iv "$(realpath "$0"):/file" alpine sh /file'
cat /etc/alpine-release
END
}