cml
cml copied to clipboard
DVC feature requests
Collection of DVC issues which CML functionality needs
Potentially needs
- pulling & pushing cache (For syncing experiments. Any other reasons?)
- https://github.com/iterative/dvc/issues/4268
- retrieve plots from run-cache https://github.com/iterative/dvc/issues/4096
-
dvc verify
https://github.com/iterative/dvc/issues/5369 -
dulwich
auto-auth using CI config - CI runner timeout
- DVC handling SIGINT, SIGTERM, or SIGKILL mid-
exp run
and mid-checkpoint- often (always?)
dvc.lock
won't be generated https://github.com/iterative/dvc/issues/6180
- often (always?)
- run-cache storage (e.g. Azure https://github.com/iterative/dvc/issues/5899)
- DVC handling SIGINT, SIGTERM, or SIGKILL mid-
- DVC needs to be aware of total number of checkpoints expected per experiment https://github.com/iterative/dvc/issues/6183
-
dvc exp run && dvc exp run
should only execute once - interrupting
dvc exp run
followed by callingdvc exp run
again should resume (rather than start from checkpoint 0)
-
- pulling & pushing cache (For syncing experiments. Any other reasons?)
- fetch experiment cache data https://github.com/iterative/dvc/issues/4649, https://github.com/iterative/dvc/issues/4268
-
dvc exp push
for >50MB commits (e.g. somehow push to DVC remote rather than Git remote?) https://github.com/iterative/dvc/issues/6181
Needs
- CI runner timeout
- [x] CML re-provisioning runner #208, #174
- [ ] attached storage #161
- [x]
dvc exp push
upon each checkpoint (e.g. via user callback? Or builtin option https://github.com/iterative/dvc/issues/6182?)
- ~~DVC needs to be aware of total number of checkpoints expected per experiment https://github.com/iterative/dvc/issues/6183~~
- [ ] or just insist that the user code must work out the current checkpoint number by looking at the workspace state
Naïve requirements (first edition)
Dulwich authentication
Commands like dvc exp pull
and dvc exp push
rely on authentication for interacting with the Git remote. In a headless setup like a continous integration pipeline, these operations should not require any kind of user interaction, but dulwich
— the library that provides DVC with Git capabilities — does not support many of the authentication hacks used by many continuous integration tools.
GitHub Actions, as per the actions/checkout@v2
action, relies on a custom authorization header set through the local repository configuration:
[http "https://github.com/"]
extraheader = AUTHORIZATION: basic ···
GitLab and others may be using different mechanisms, like SSH keys or credential helpers, so this would require further investigation. See https://github.com/dulwich/dulwich/issues/873 and https://github.com/dulwich/dulwich/issues/882 for a similar request.
Possible fixes
- Improve
dulwich
to support common authentication methods 😌 - Outsource push and pull operations to the
git
command-line tool 🙊
Automatic dvc exp push
on checkpoints
In the spot instance and limited execution time scenarios, we need to provide users with a way of saving their checkpoints to their DVC remote and the experiment references to the Git remote each time a given number of checkpoints is captured.
Possible fixes
- Extend DVC to provide
dvc exp run --push-checkpoints <remote> --push-checkpoints-each=<count>
with a callback 😌 - ~Watch the
$DVC_ROOT/.dvc/tmp/DVC_CHECKPOINT
file and trigger a push once it gets deleted~ not even possible 🙊
Lazy experiment pull and apply
In order to keep CML workflows as simple as possible, we should probably abstract the differences between newly created tasks and resumed tasks. It would be interesting to have some DVC flags to avoid returning a non-zero exit code when the referenced experiment doesn't exist yet.
Possible fixes
- Extend DVC to provide
dvc exp pull --lazy <remote> <experiment>
anddvc exp apply --lazy <experiment>
🤔 - Use
dvc exp list <remote>
to determine if the experiment exists and only pull it in that case 🙊
All the suggestions above assume that we're going to use DVC experiments to track CML runs, and those experiments will be deterministically named after the commit & branch that triggered the run.
Lazy experiment pull and apply
don't see much of a problem with e.g. dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run
don't see much of a problem with e.g.
dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run
Me neither, that's why I used the 🤔 emoji above. Nevertheless, we would be silently masking any error — even network ones — under the experiment not found case, and this might not be good on a continuous integration scenario where failing early is better than blindly using expensive resources.
CML workflow stoppage or the endless training problem
If the workflow stops, CML runner should be able to restart the workflow and continue the training from the last checkpoint.
We conduct a series of experiments assuming that the training generates incremental checkpoints like tensorflow. We could have saved the state as many other framework, however the chosen method is simple in implementation and easier to grasp whats going on.
Tensorflow example checkpoints
saver.save(sess, 'my_model',global_step=1000)
my_model-1000.index
my_model-1000.meta
my_model-1000.data-00000-of-00001
checkpoint
When
- timeout (Github 3h/72 with self hosted runners)
- Spot instance of Cloud runner termination
To simulate this stoppage we setup a workflow timeout if 1min
Expected
The workflow should be able to be restarted and continue training from the last checkpoint until completed.
Problem
With DVC as storage all the experiments needs at some point to handle dvc.lock
before die.
In some cases like repro
and exp run
dvc.lock
is not accesible until the very end.
Trials
DVC repro
- Alters dvc.lock after repro, never before.
Implementation: Our training process tries to push the models folder into DVC with different strategies:
dvc push
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull || echo 'Forgive this :pray:'
dvc repro
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push
dvc .yaml
stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true
#!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc push
fi
done
dvc commit & push
dvc commit
is suggested by DVC itself. We knew in advance that this was not working but it might be misleadig for the user.
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull || echo 'Forgive this :pray:'
dvc repro
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push
dvc .yaml
stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true
#!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc add models
dvc commit
dvc push
fi
done
dvc push --run-cache
.github/workflows/cml.yaml
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull --run-cache || echo 'Forgive this :pray:'
dvc repro --pull
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push --run-cache
dvc .yaml
stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true
train.sh
#!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc push --run-cache
fi
done
dvc commit & push --run-cache
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull --run-cache || echo 'Forgive this :pray:'
dvc repro --pull
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push --run-cache
dvc .yaml
stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true
#!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc add models
dvc commit
dvc push --run-cache
fi
done
Problems:
We just only have one dvc.lock file that will occur after repro.
Hence if the trainig is stopped before dvc.lock is commited DVC can not recover the state in the next run and restarts from zero.
cml-pr
is useless here.
DVC run exp
- Alters dvc.lock after repro, never before.
Implementation: This is just a variation of dvc repro.
Problems
Exactly as repro
dvc push
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull || echo 'Forgive this :pray:'
dvc run exp
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push
dvc .yaml
stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
cache: true
persist: true
#!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc push
fi
done
DVC run exp checkpoints:
- ephemereal commit
- updates dvc.lock
- resumes always from the last checkpoint
dvc push
name: train-my-model
on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- uses: iterative/setup-cml@v1
- uses: iterative/setup-dvc@v1
- name: cml
shell: bash
timeout-minutes: 1
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
dvc pull || echo 'Forgive this :pray:'
dvc exp run
echo '## CML Report' > report.md
cml-pr dvc.lock .gitignore >> report.md
dvc push
dvc .yaml
stages:
mystage:
cmd: ./train.sh
deps:
- train.sh
outs:
- models:
checkpoint: true
#!/bin/bash
mkdir models || echo 'Folder models is ready'
for STEP in {1..3}
do
MODELF="models/model-checkpoint-$STEP"
if [[ ! -f $MODELF ]]; then
echo "training step $STEP..."
sleep 30
echo "saving weights $STEP"
echo "weights $RANDOM" > $MODELF
echo 'dvc push'
dvc push
echo 'cml-pr'
cml-pr '.gitignore' 'dvc.lock'
fi
done
Problems
- Checkpoints resumes training from the last checkpoint. This will endup in endless training.
- Using cml-pr generates many PR
A plausible solution would be merge the last PR enforcing the CI to restart and continue from there. While our simple script that is checking the existance of several files would success a real scenario would end up in an endless training if that check is not done also in the training.
My 2 cents: I feel this issue https://github.com/iterative/dvc/issues/5369 is related to cml, I'd love to have the ability to check if I need to repro the pipeline without having to spin up a self-hosted runner and pull the data.
SIGINT is not very effective when running dvc exp run
several times has to be triggered
Lazy experiment pull and apply
don't see much of a problem with e.g.
dvc exp pull <remote> <experiment> && dvc exp apply <experiment> || dvc exp run
btw @0x2b3bfa0 would dvc exp run --pull
satisfy you here? Or is there a different use case for dvc exp {pull,apply} --lazy
?
btw @0x2b3bfa0 would
dvc exp run --pull
satisfy you here? Or is there a different use case fordvc exp {pull,apply} --lazy
?
If we use the run cache to save checkpoints, that would be much more elegant that my earlier suggestion.
potential workflow:
dvc exp run --name JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull
# ... (auto)kill via SIGINT after ~72h ... # CML does this 5 min early
dvc exp push # CML does this
dvc push # CML does this
# ... CML restarts the workflow
better alternative:
dvc exp run --name CI_JOB_ID_UNCHANGED_UPON_KILL_AND_RESTART --pull --push-every-checkpoint
# ... (auto)kill via SIGINT after ~72h ...
# ... CML restarts the workflow
Note that using COMMIT_SHA instead of CI_JOB_ID might not work in cases where the exp params are not stored in the commit (i.e. 2 job ids with different params but yet same commit sha).
to be revisited