cml icon indicating copy to clipboard operation
cml copied to clipboard

DVC feature requests

Open casperdcl opened this issue 3 years ago • 9 comments

Collection of DVC issues which CML functionality needs

Potentially needs

  • pulling & pushing cache (For syncing experiments. Any other reasons?)
    • https://github.com/iterative/dvc/issues/4268
    • retrieve plots from run-cache https://github.com/iterative/dvc/issues/4096
  • dvc verify https://github.com/iterative/dvc/issues/5369
  • dulwich auto-auth using CI config
  • CI runner timeout
    • DVC handling SIGINT, SIGTERM, or SIGKILL mid-exp run and mid-checkpoint
      • often (always?) dvc.lock won't be generated https://github.com/iterative/dvc/issues/6180
    • run-cache storage (e.g. Azure https://github.com/iterative/dvc/issues/5899)
  • DVC needs to be aware of total number of checkpoints expected per experiment https://github.com/iterative/dvc/issues/6183
    • dvc exp run && dvc exp run should only execute once
    • interrupting dvc exp run followed by calling dvc exp run again should resume (rather than start from checkpoint 0)
  • pulling & pushing cache (For syncing experiments. Any other reasons?)
    • fetch experiment cache data https://github.com/iterative/dvc/issues/4649, https://github.com/iterative/dvc/issues/4268
    • dvc exp push for >50MB commits (e.g. somehow push to DVC remote rather than Git remote?) https://github.com/iterative/dvc/issues/6181

Needs

  • CI runner timeout
    • [x] CML re-provisioning runner #208, #174
    • [ ] attached storage #161
    • [x] dvc exp push upon each checkpoint (e.g. via user callback? Or builtin option https://github.com/iterative/dvc/issues/6182?)
  • ~~DVC needs to be aware of total number of checkpoints expected per experiment https://github.com/iterative/dvc/issues/6183~~
    • [ ] or just insist that the user code must work out the current checkpoint number by looking at the workspace state

casperdcl avatar May 25 '21 18:05 casperdcl

Naïve requirements (first edition)

Dulwich authentication

Commands like dvc exp pull and dvc exp push rely on authentication for interacting with the Git remote. In a headless setup like a continous integration pipeline, these operations should not require any kind of user interaction, but dulwich — the library that provides DVC with Git capabilities — does not support many of the authentication hacks used by many continuous integration tools.

GitHub Actions, as per the actions/checkout@v2 action, relies on a custom authorization header set through the local repository configuration:

[http "https://github.com/"]
        extraheader = AUTHORIZATION: basic ···

GitLab and others may be using different mechanisms, like SSH keys or credential helpers, so this would require further investigation. See https://github.com/dulwich/dulwich/issues/873 and https://github.com/dulwich/dulwich/issues/882 for a similar request.

Possible fixes

  • Improve dulwich to support common authentication methods 😌
  • Outsource push and pull operations to the git command-line tool 🙊

Automatic dvc exp push on checkpoints

In the spot instance and limited execution time scenarios, we need to provide users with a way of saving their checkpoints to their DVC remote and the experiment references to the Git remote each time a given number of checkpoints is captured.

Possible fixes

Lazy experiment pull and apply

In order to keep CML workflows as simple as possible, we should probably abstract the differences between newly created tasks and resumed tasks. It would be interesting to have some DVC flags to avoid returning a non-zero exit code when the referenced experiment doesn't exist yet.

Possible fixes

  • Extend DVC to provide dvc exp pull --lazy <remote> <experiment> and dvc exp apply --lazy <experiment> 🤔
  • Use dvc exp list <remote> to determine if the experiment exists and only pull it in that case 🙊

All the suggestions above assume that we're going to use DVC experiments to track CML runs, and those experiments will be deterministically named after the commit & branch that triggered the run.

0x2b3bfa0 avatar Jun 01 '21 16:06 0x2b3bfa0

Lazy experiment pull and apply

don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run

casperdcl avatar Jun 01 '21 16:06 casperdcl

don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run

Me neither, that's why I used the 🤔 emoji above. Nevertheless, we would be silently masking any error — even network ones — under the experiment not found case, and this might not be good on a continuous integration scenario where failing early is better than blindly using expensive resources.

0x2b3bfa0 avatar Jun 01 '21 16:06 0x2b3bfa0

CML workflow stoppage or the endless training problem

If the workflow stops, CML runner should be able to restart the workflow and continue the training from the last checkpoint.

We conduct a series of experiments assuming that the training generates incremental checkpoints like tensorflow. We could have saved the state as many other framework, however the chosen method is simple in implementation and easier to grasp whats going on.

Tensorflow example checkpoints

saver.save(sess, 'my_model',global_step=1000)

my_model-1000.index
my_model-1000.meta
my_model-1000.data-00000-of-00001
checkpoint

When

  • timeout (Github 3h/72 with self hosted runners)
  • Spot instance of Cloud runner termination

To simulate this stoppage we setup a workflow timeout if 1min

Expected

The workflow should be able to be restarted and continue training from the last checkpoint until completed.

Problem

With DVC as storage all the experiments needs at some point to handle dvc.lock before die. In some cases like repro and exp run dvc.lock is not accesible until the very end.

Trials

DVC repro

  • Alters dvc.lock after repro, never before.

Implementation: Our training process tries to push the models folder into DVC with different strategies:

dvc push

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc repro

          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push
        fi
done
dvc commit & push

dvc commit is suggested by DVC itself. We knew in advance that this was not working but it might be misleadig for the user.

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc repro
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc add models
            dvc commit
            dvc push
        fi
done
dvc push --run-cache

.github/workflows/cml.yaml

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull --run-cache || echo 'Forgive this :pray:'
          dvc repro --pull
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push --run-cache

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 

train.sh

#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push --run-cache
        fi
done
dvc commit & push --run-cache

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull --run-cache || echo 'Forgive this :pray:'
          dvc repro --pull
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push --run-cache

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc add models
            dvc commit
            dvc push --run-cache
        fi
done

Problems: We just only have one dvc.lock file that will occur after repro. Hence if the trainig is stopped before dvc.lock is commited DVC can not recover the state in the next run and restarts from zero. cml-pr is useless here.

DVC run exp

  • Alters dvc.lock after repro, never before.

Implementation: This is just a variation of dvc repro.

Problems Exactly as repro

dvc push

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc run exp

          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        cache: true
        persist: true 
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push
        fi
done

DVC run exp checkpoints:

  • ephemereal commit
  • updates dvc.lock
  • resumes always from the last checkpoint
dvc push

name: train-my-model

on: [push]

jobs:
  run:
    runs-on: [ubuntu-latest]
  
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 0

      - uses: iterative/setup-cml@v1

      - uses: iterative/setup-dvc@v1

      - name: cml
        shell: bash
        timeout-minutes: 1
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull || echo 'Forgive this :pray:'
          dvc exp run
          
          echo '## CML Report' > report.md
          cml-pr dvc.lock .gitignore >> report.md
          dvc push

dvc .yaml

stages:
  mystage:
    cmd: ./train.sh
    deps:
    - train.sh
    outs:
    - models: 
        checkpoint: true
#!/bin/bash

mkdir models || echo 'Folder models is ready'

for STEP in {1..3}
    do
        MODELF="models/model-checkpoint-$STEP"
        if [[ ! -f $MODELF ]]; then
            echo "training step $STEP..."
            sleep 30

            echo "saving weights $STEP"
            echo "weights $RANDOM" > $MODELF

            echo 'dvc push'
            dvc push

            echo 'cml-pr'
            cml-pr '.gitignore' 'dvc.lock'
        fi
done

Problems

  • Checkpoints resumes training from the last checkpoint. This will endup in endless training.
  • Using cml-pr generates many PR

A plausible solution would be merge the last PR enforcing the CI to restart and continue from there. While our simple script that is checking the existance of several files would success a real scenario would end up in an endless training if that check is not done also in the training.

DavidGOrtega avatar Jun 01 '21 17:06 DavidGOrtega

My 2 cents: I feel this issue https://github.com/iterative/dvc/issues/5369 is related to cml, I'd love to have the ability to check if I need to repro the pipeline without having to spin up a self-hosted runner and pull the data.

courentin avatar Jun 04 '21 13:06 courentin

SIGINT is not very effective when running dvc exp run several times has to be triggered

DavidGOrtega avatar Jun 11 '21 14:06 DavidGOrtega

Lazy experiment pull and apply

don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run

btw @0x2b3bfa0 would dvc exp run --pull satisfy you here? Or is there a different use case for dvc exp {pull,apply} --lazy?

casperdcl avatar Jun 15 '21 17:06 casperdcl

btw @0x2b3bfa0 would dvc exp run --pull satisfy you here? Or is there a different use case for dvc exp {pull,apply} --lazy?

If we use the run cache to save checkpoints, that would be much more elegant that my earlier suggestion.

0x2b3bfa0 avatar Jun 15 '21 17:06 0x2b3bfa0

potential workflow:

dvc exp run --name JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull
# ... (auto)kill via SIGINT after ~72h ... # CML does this 5 min early
dvc exp push # CML does this
dvc push # CML does this
# ... CML restarts the workflow

better alternative:

dvc exp run --name CI_JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull --push-every-checkpoint
# ... (auto)kill via SIGINT after ~72h ...
# ... CML restarts the workflow

Note that using COMMIT_SHA instead of CI_JOB_ID might not work in cases where the exp params are not stored in the commit (i.e. 2 job ids with different params but yet same commit sha).

casperdcl avatar Jun 16 '21 12:06 casperdcl

to be revisited

dacbd avatar Feb 17 '23 15:02 dacbd