terraform-provider-iterative icon indicating copy to clipboard operation
terraform-provider-iterative copied to clipboard

WIP EPIC: Remote Executor ala Sagemaker

Open DavidGOrtega opened this issue 2 years ago • 5 comments

Sagemaker

It will:

  • Upload your training script and dependencies to Amazon S3
  • Provision the specified number of instances in a fully managed cluster
  • Pull the specified container image and instantiate containers on every instance
  • Download the training code from Amazon S3 into the instance and make it available in the container
  • Download training dataset from Amazon S3 and make it available in the container
  • Run training
  • Copy trained models to a specified location in Amazon S3

Additionally with spot instances:

  • automatically backup your training checkpoints to Amazon S3 stored in /opt/ml/checkpoints by default
  • if the training instance is terminated due to lack of capacity, it’ll keep polling for capacity, and automatically restart training once capacity becomes available coping your dataset and the checkpoint files into the new instance and make it available to your training script in a docker container so that you can resume training from the latest checkpoint

Cons:

  • Docker dep
  • Sagemaker Python SDK dep and entrypoint stimator functions

TPI Remote Executor

Use case

Users want to train their models with:

  • a) remote machines that give access to better hardware #210
  • b) using spot instances to reduce costs #211
  • c) resilient to failures like machine disposal or other hardware issues #211
  • d) Easy access to logs and machine
  • e) Support for many languages (R, python... like DVC cli)
  • f) current workspace is replicated in remote #212
  • g) sync data with remote #211

Plausible Solutions

TPI Machine

Users could just spin a TPI machine setting the startup script. The problem is that this approach does not satisfy b nor c, nor f. Having the user to sync the workspace and data on her own and migrate to another spot instance.

Executor

A very simple multi platform program that executes a piece of code and also is in charge of:

  • Preparing workspace via repo cloning
  • Watch out files to automatically sync them within a remote
  • Watch out instance termination and initiate migration strategy
  • Additionally exposes:
    • logging
    • interactive shell
    • metrics

Additionally:

  • The executor is started locally but does not need the local machine to keep alive

We already have a pretty close executor in place, our cml-runner that actually runs CI runners. If it would execute a script instead of execute the runners we would have an executor skeleton.

Watch out files to automatically sync them within a remote

Executor watches a configurable folder that syncs with a configurable remote storage on file change:

  • internally dvc?
  • must handle interruptions

Watch out instance termination and initiate migration strategy

  • Fire and forget. Using cloud vendors capabilities of requesting spot instances via api with max wait time this would allow to do the request before die. Problems to this strategy:
    • We do not have how to contact with the executor during this migration.
    • If the request fails by any meaning the executor is not able to resume
  • Seed on cheaper instance. The executor moves to a small instance to be able to poll for capacity and then continue from the spot instance.

Logging

The agent could expose a small UI with logs and shell with Basic auth using introspectable local tunnels like the Github action tmate or ngrok. The subdomain would be generated by the TPI reusing it always during the executor lifetime granting the access via password like Jupyter.

I.E. 1234.cml.dev

#209 would be tricky having to use a constant lock mechanism in the remote bucket probably to determine if the executor is already up.

Agent

The user launches an agent that waits for jobs to be executed. The main difference between an executor and the agent is that the latter needs the support of an external queue of jobs like a CI runner. Indeed, It would be close to a cml-runner, however instead of using the SCM they would be using Viewer UI to experiment.

Effective to address #209 .

Cons:

  • It would depend on Viewer or another solution. If this happens

This could be also an extended feature of an Executor. Also a good reason for Viewer adoption

DavidGOrtega avatar Sep 13 '21 14:09 DavidGOrtega

image

DavidGOrtega avatar Oct 04 '21 10:10 DavidGOrtega

Acceptance tests proposal WIP
name: Remote executor acceptance tests
on:
  push

env:
  REPO_TOKEN: ${{ secrets.REPO_TOKEN }}
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

jobs:
  case-1:
    runs-on: ubuntu-18.04
    timeout-minutes: 10
    steps:
      - name: Execute remote script with ENV
          run: |
            cat << EOF >> main.tf
              terraform {
                required_providers {
                  iterative = {
                    source = "iterative/iterative"
                  }
                }
              }

              provider "iterative" {}

              resource "iterative_remote_exec" "exec" {
                cloud = "aws"
                region = "us-west"
                instance_type = "t2.micro"
                instance_hdd_size = 30
                spot = true

                command = <<-EOF
                  echo $MYVAR
                -EOF
              }
            EOF

            export MYVAR='Hello world'

            terraform init
            terraform apply --auto-approve

            # We have to loop here to check if status is finished

            if [! grep 'Hello world' main.tfstate;] then
              exit 125
            fi
            
  case-2:
    runs-on: ubuntu-18.04
    timeout-minutes: 10
    steps:
      - name: Execute remote script with work folder
          run: |
            cat << EOF >> main.tf
              terraform {
                required_providers {
                  iterative = {
                    source = "iterative/iterative"
                  }
                }
              }

              provider "iterative" {}

              resource "iterative_remote_exec" "exec" {
                cloud = "aws"
                region = "us-west"
                instance_type = "t2.micro"
                instance_hdd_size = 30
                spot = true

                command = <<-EOF
                  cat myfile.txt
                -EOF
              }
            EOF

            echo 'Hello world' > myfile.txt

            terraform init
            terraform apply --auto-approve

            # We have to loop here to check if status is finished

            if [! grep 'Hello world' main.tfstate;] then
              exit 125
            fi

  case-3:
    runs-on: ubuntu-18.04
    timeout-minutes: 10
    steps:
      - name: Sync data with remote assuming /opt/tpi/
          run: |
            cat << EOF >> main.tf
              terraform {
                required_providers {
                  iterative = {
                    source = "iterative/iterative"
                  }
                }
              }

              provider "iterative" {}

              resource "iterative_remote_exec" "exec" {
                cloud = "aws"
                region = "us-west"
                instance_type = "t2.micro"
                instance_hdd_size = 30
                spot = true

                command = <<-EOF
                  echo 'Hello world' > /opt/tpi/myfile.txt
                -EOF
              }
            EOF

            terraform init
            terraform apply --auto-approve

            # We have to loop here to check if status is finished
            
            $EXEC_ID=''
            apt install -y awscli
            aws s3 ls "s3://${EXEC_ID}/myfile.txt"
            if [[ $? -ne 0 ]]; then
              exit 125
            fi

Faq

  • How we could reuse the main.tf declaring a new resource? Is a good idea? From @jbencook

  • Is remote State going to be stored in the same provider than the executor instance?

    • Could be a different provider?
    • handled by DVC?
  • Do we have to sync a specific folder or could be better just the current workdir? If not specified by var workdir?

  • Do we have to sync all the env or maybe the ENV specified in a var env in TPI?

  • Should we reuse cml-runner or should we fork GL runner?

    • if not now, should we then or make sense to build our own Multi SCM agent?
  • Spot Survival approach relies on cloud capabilities to be able to enqueue the spot instance order.

    • Do all the clouds supports this feature?
    • How to handle the spot max waiting time?

DavidGOrtega avatar Oct 04 '21 10:10 DavidGOrtega

Agent

The user launches an agent that waits for jobs to be executed. The main difference between an executor and the agent is that the latter needs the support of an external queue of jobs like a CI runner. Indeed, It would be close to a cml-runner, however instead of using the SCM they would be using Viewer UI to experiment.

So this sounds much more in line with what we would like to have in DVC with regard to remote execution vs the current task runner implementation.

Basically, we would like to be able to start a machine that we can use to run some set of arbitrary jobs/commands on. DVC would handle syncing the appropriate data to some workspace directory on the remote machine. Then DVC (locally) determine what should get run remotely, and queue the appropriate tasks via the agent. Once tasks are added to the queue, the agent would run them "unattended" and then DVC (locally) would be able to request the status/results of those tasks as needed. The machine & agent would continue to stay up and wait for more tasks, until DVC (locally) is told to destroy the remote resource.

So in this scenario, terraform/TPI would be used to create/destroy the terraform resource (machine + agent daemon), but data sync and the actual task (DVC pipeline) execution would be handled by DVC (via API calls made to the agent daemon).


Currently in DVC, we are prototyping a very basic & simplified way to do this ourselves using iterative_machine + SSH, without any daemon/server process running on the remote machine. But if a more full-featured agent/daemon is something that is actually needed by CML/the viewer/any other potential consumers, from the DVC perspective I think this would be much nicer (and higher priority) to have than the standalone executor/task runner.

@dberenbaum @dmpetrov

pmrowla avatar Dec 08 '21 07:12 pmrowla

@pmrowla Originally the task™ was conceptually just a plain executor being the agent a more interesting piece. As conceived the agent could run dvc.yaml exactly the same way that a runner do CI pipelines. However to do so we should investigate how to achieve the communication between the two parts. Having a middleware is something that we have been always constrained to, being the Viewer our only product having one.

DavidGOrtega avatar Dec 09 '21 10:12 DavidGOrtega

Right, I think and having some kind of middleware that provides more granular control over task/job execution is what would fit DVC's needs the best

pmrowla avatar Dec 09 '21 11:12 pmrowla