dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Revising checkpoints workflow: decouple checkpoints from pipeline

Open dberenbaum opened this issue 3 years ago • 1 comments

See #6104 for the full discussion (summary comment pasted below).

Here's the basic proposal of how I see this working:

  • Map initial dependency state to final output state in the run-cache.
  • Don't save to the run-cache until the experiment "completes."
  • Save the state of each checkpoint in the cache but not the run-cache.

If another user does a fresh clone of the repo, dvc pulls run-cache, and then tries to do dvc exp run with that same dependency state, is the expected result that the stage would be re-run (and 10 checkpoints would be generated)? Or is the expected result that DVC would use the run-cache entry and just generate a single exp commit containing the run-cached "post-10-epochs" output state?

The latter makes sense to me. This is what would happen with a non-checkpoint experiment, right? Ideally, if the user pulls the experiment in addition to the run-cache, then they would get a conflict and be referred back to the original experiment, right?

Will checkpoints still be resumable?

For interrupted experiments, they can behave the same as now, preferably resuming from the last checkpoint. "Completed" experiments would not be resumable.

If the user changes the dependency state, does DVC re-run the entire thing from scratch using the new dependency state, or does DVC resume from the result of the initial run?

For interrupted experiments, they can behave the same as now, resuming from the result of the initial run. Once the experiment completes successfully, then DVC should no longer resume that experiment and should start a new experiment from scratch if the dependency state changes (or refuse to run without -f if the dependency state has not changed).

If I do dvc exp run -S epochs=20 from this state, does DVC start from scratch, and run 20 brand-new checkpoints?

Yes, since this would match the non-checkpoints behavior. If a user wants to extend experiments manually until finding the right number of epochs, I would suggest not setting epochs as a parameter since that implies a fixed, predetermined number of epochs.

Originally posted by @dberenbaum in https://github.com/iterative/dvc/discussions/6104#discussioncomment-1888905

dberenbaum avatar Feb 14 '22 22:02 dberenbaum

A concise summary of the request is to decouple checkpoints from the pipeline. Checkpoints should only be responsible for caching data and generating a Git ref, not any pipeline execution.

dberenbaum avatar Sep 09 '22 13:09 dberenbaum