Feature exp run: Dryer resume within the CI
Issue
In the CI, to be able to resume training with preexisting checkpoints we have to make something like:
EXP_NAME=cml-run-${GITHUB_SHA}
EXP_AVAIL=$(dvc exp pull --run-cache origin $EXP_NAME || echo '')
if [[ -z "$EXP_AVAIL" ]]; then
echo "############\nFirst Time\n############"
dvc exp run -n $EXP_NAME --pull -v
else
echo "############\nResuming\n############"
dvc exp apply $EXP_NAME
dvc exp run -v
fi
Would be nice if we had:
- a flag with
dvc exp run -n $EXP_NAMEto be able to pull and apply
So it would become:
EXP_NAME=cml-run-${GITHUB_SHA}
dvc exp run -n $EXP_NAME --pull-apply -v
Additional issue
Please note:
EXP_AVAIL=$(dvc exp pull --run-cache origin $EXP_NAME || echo '')
This is because dvc exp pull --run-cache origin $EXP_NAME will throw an error in no prev experiments are present
Posting old message before it gets lost: upshot of auto-pull checkpoints, we need to
exp pull && exp applyexp run: specify an experiment name first time but not when resuming
EXP_NAME=${BASE}-cml-run-${SHA} # similar convention as cml-pr
if [[ $(dvc exp pull --run-cache origin $EXP_NAME &>/dev/null) ]]; then
echo "# resuming interrupted experiment"
dvc exp apply $EXP_NAME
DVC_EXP_AUTO_PUSH=1
DVC_EXP_GIT_REMOTE=origin dvc exp run ...
else
echo "# first time running experiment"
DVC_EXP_AUTO_PUSH=1
DVC_EXP_GIT_REMOTE=origin dvc exp run -n $EXP_NAME ...
fi
exp run: specify an experiment name first time but not when resuming
One minor note: You should be able to use dvc exp run -n $EXP_NAME even when resuming experiments. It succeeds but generates a warning like WARNING: Ignoring option '--name exp-c1734' for resumed experiment. Existing experiment name will be preserved instead.
With that in mind, the workflow can be something like:
EXP_NAME=${BASE}-cml-run-${SHA} # similar convention as cml-pr
dvc exp pull --run-cache origin $EXP_NAME || true
dvc exp apply $EXP_NAME || true
DVC_EXP_AUTO_PUSH=1
DVC_EXP_GIT_REMOTE=origin dvc exp run -n $EXP_NAME ...
also related: https://github.com/iterative/example-repos-dev/issues/83#issuecomment-1128824098
Closing since checkpoints have been deprecated. For discussion about resuming experiments, see https://github.com/iterative/dvclive/issues/505.