cml.dev guide: GH resuming workflow

Add a self-hosted long-running example to https://cml.dev/doc/cml-with-dvc (or somewhere else)

GH action launches "self-hosted" GCP/AWS using cml runner --reuse --labels=cml and probably --cloud-spot
GH action runs the rest of the workflow on the "self-hosted" runner using runs-on: [self-hosted, cml] and timeout-minutes: 50400
If GH action is about to timeout, CML will restart the workflow

i.e. https://cml.dev/doc/self-hosted-runners?tab=GitHub#allocating-cloud-compute-resources-with-cml
The key is requesting GH's maximum timeout-minutes: 50400 - this signals to CML to restart the workflow just before timeout.
write code to cache results so that the restarted workflow will use previous results (e.g. use https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints and https://github.com/iterative/dvc/issues/6823)

Mar 19 '22 11:03 casperdcl

more musings (for cml runner --cloud-spot):

live = dvclive.Live(resume=True)
model = Model(load="model.pkl" if Path("model.pkl").exists() else None)
while (epoch := live.get_step()) < 100:
    history = model.fit(X, Y, epochs=1)
    if epoch % 10 == 0:  # at most 10 epochs are lost upon CML respawing a spot instance
        model.save("model.pkl")
    live.log("loss", history['loss'])
    live.next_step()

Apr 01 '22 15:04 casperdcl

Out of curiosity, what makes this p1? Perhaps there are there lots of support cases that could be avoided by or redirected to this? Thanks

Sep 29 '22 18:09 jorgeorpinel

lots of support requests over YEARS; super overdue.

Oct 03 '22 12:10 casperdcl

deprioritized and frozen. Removing from CML project board for now

Apr 13 '23 01:04 omesser