cml.dev icon indicating copy to clipboard operation
cml.dev copied to clipboard

guide: GH resuming workflow

Open casperdcl opened this issue 3 years ago • 4 comments

Add a self-hosted long-running example to https://cml.dev/doc/cml-with-dvc (or somewhere else)

  1. GH action launches "self-hosted" GCP/AWS using cml runner --reuse --labels=cml and probably --cloud-spot
  2. GH action runs the rest of the workflow on the "self-hosted" runner using runs-on: [self-hosted, cml] and timeout-minutes: 50400
  3. If GH action is about to timeout, CML will restart the workflow
  • i.e. https://cml.dev/doc/self-hosted-runners?tab=GitHub#allocating-cloud-compute-resources-with-cml
  • The key is requesting GH's maximum timeout-minutes: 50400 - this signals to CML to restart the workflow just before timeout.
  • write code to cache results so that the restarted workflow will use previous results (e.g. use https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints and https://github.com/iterative/dvc/issues/6823)

casperdcl avatar Mar 19 '22 11:03 casperdcl

more musings (for cml runner --cloud-spot):

live = dvclive.Live(resume=True)
model = Model(load="model.pkl" if Path("model.pkl").exists() else None)
while (epoch := live.get_step()) < 100:
    history = model.fit(X, Y, epochs=1)
    if epoch % 10 == 0:  # at most 10 epochs are lost upon CML respawing a spot instance
        model.save("model.pkl")
    live.log("loss", history['loss'])
    live.next_step()

casperdcl avatar Apr 01 '22 15:04 casperdcl

Out of curiosity, what makes this p1? Perhaps there are there lots of support cases that could be avoided by or redirected to this? Thanks

jorgeorpinel avatar Sep 29 '22 18:09 jorgeorpinel

lots of support requests over YEARS; super overdue.

casperdcl avatar Oct 03 '22 12:10 casperdcl

deprioritized and frozen. Removing from CML project board for now

omesser avatar Apr 13 '23 01:04 omesser