cml.dev
cml.dev copied to clipboard
guide: GH resuming workflow
Add a self-hosted long-running example to https://cml.dev/doc/cml-with-dvc (or somewhere else)
- GH action launches "self-hosted" GCP/AWS using
cml runner --reuse --labels=cmland probably--cloud-spot - GH action runs the rest of the workflow on the "self-hosted" runner using
runs-on: [self-hosted, cml]andtimeout-minutes: 50400 - If GH action is about to timeout, CML will restart the workflow
- i.e. https://cml.dev/doc/self-hosted-runners?tab=GitHub#allocating-cloud-compute-resources-with-cml
- The key is requesting GH's maximum
timeout-minutes: 50400- this signals to CML to restart the workflow just before timeout. - write code to cache results so that the restarted workflow will use previous results (e.g. use https://dvc.org/doc/user-guide/experiment-management/checkpoints#caching-checkpoints and https://github.com/iterative/dvc/issues/6823)
more musings (for cml runner --cloud-spot):
live = dvclive.Live(resume=True)
model = Model(load="model.pkl" if Path("model.pkl").exists() else None)
while (epoch := live.get_step()) < 100:
history = model.fit(X, Y, epochs=1)
if epoch % 10 == 0: # at most 10 epochs are lost upon CML respawing a spot instance
model.save("model.pkl")
live.log("loss", history['loss'])
live.next_step()
Out of curiosity, what makes this p1? Perhaps there are there lots of support cases that could be avoided by or redirected to this? Thanks
lots of support requests over YEARS; super overdue.
deprioritized and frozen. Removing from CML project board for now