dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Feature exp run: Dryer resume within the CI

Open DavidGOrtega opened this issue 4 years ago • 3 comments

Issue

In the CI, to be able to resume training with preexisting checkpoints we have to make something like:

EXP_NAME=cml-run-${GITHUB_SHA}
EXP_AVAIL=$(dvc exp pull --run-cache origin $EXP_NAME || echo '')
if [[ -z "$EXP_AVAIL" ]]; then
    echo "############\nFirst Time\n############"
    dvc exp run -n $EXP_NAME --pull -v
else    
    echo "############\nResuming\n############"
    dvc exp apply $EXP_NAME
    dvc exp run -v
fi

Would be nice if we had:

  • a flag with dvc exp run -n $EXP_NAME to be able to pull and apply

So it would become:

EXP_NAME=cml-run-${GITHUB_SHA}
dvc exp run -n $EXP_NAME --pull-apply -v

Additional issue

Please note:

EXP_AVAIL=$(dvc exp pull --run-cache origin $EXP_NAME || echo '')

This is because dvc exp pull --run-cache origin $EXP_NAME will throw an error in no prev experiments are present

DavidGOrtega avatar Oct 18 '21 11:10 DavidGOrtega

Posting old message before it gets lost: upshot of auto-pull checkpoints, we need to

  • exp pull && exp apply
  • exp run: specify an experiment name first time but not when resuming
EXP_NAME=${BASE}-cml-run-${SHA} # similar convention as cml-pr

if [[ $(dvc exp pull --run-cache origin $EXP_NAME &>/dev/null) ]]; then
  echo "# resuming interrupted experiment"
  dvc exp apply $EXP_NAME
  DVC_EXP_AUTO_PUSH=1
  DVC_EXP_GIT_REMOTE=origin dvc exp run ...
else
  echo "# first time running experiment"
  DVC_EXP_AUTO_PUSH=1
  DVC_EXP_GIT_REMOTE=origin dvc exp run -n $EXP_NAME ...
fi

casperdcl avatar Mar 19 '22 11:03 casperdcl

  • exp run: specify an experiment name first time but not when resuming

One minor note: You should be able to use dvc exp run -n $EXP_NAME even when resuming experiments. It succeeds but generates a warning like WARNING: Ignoring option '--name exp-c1734' for resumed experiment. Existing experiment name will be preserved instead.

With that in mind, the workflow can be something like:

EXP_NAME=${BASE}-cml-run-${SHA} # similar convention as cml-pr
dvc exp pull --run-cache origin $EXP_NAME || true
dvc exp apply $EXP_NAME || true
DVC_EXP_AUTO_PUSH=1
DVC_EXP_GIT_REMOTE=origin dvc exp run -n $EXP_NAME ...

dberenbaum avatar Mar 22 '22 15:03 dberenbaum

also related: https://github.com/iterative/example-repos-dev/issues/83#issuecomment-1128824098

casperdcl avatar Jun 10 '22 13:06 casperdcl

Closing since checkpoints have been deprecated. For discussion about resuming experiments, see https://github.com/iterative/dvclive/issues/505.

dberenbaum avatar Dec 01 '23 13:12 dberenbaum