kuberay [Feature] Checkpoint API to recover from checkpoint from previous runs

[Feature] Checkpoint API to recover from checkpoint from previous runs

Open sathyanarays opened this issue 1 year ago • 1 comments

trafficstars

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

Steps to reproduce

There are examples that illustrate checkpointing and recovering from checkpointing in the Ray training frameworks. One such example illustrates how to configure checkpointing to a pytorch training job.

1. Trigger the training RayJob

 kubectl apply rayjob.yaml

2. Kill the head pod

Let the training job make a couple of checkpoints and then kill the head pod.

kubectl delete pods rayjob-sample-raycluster-lv85g-head-tbfwq

3. The new driver ignores the checkpoint

The current driver pod errors out and a new driver pod gets created. The new driver pod runs the training job again from scratch ignoring the checkpoints produced in the last run.

Hacky Fix

To overcome this problem, we have to write a function with a tightly coupled logic. For example, look at the function findLatestCheckpoint in this job definition.

Use case

It would be great if we have an API that we can call and get the latest checkpoint location for the previous iteration of the given run.

Related issues

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

May 17 '24 12:05 sathyanarays

What is the relationship between this issue and KubeRay? It seems like a Ray Train issue.

May 18 '24 17:05 kevin85421

Closing as this is covered by https://docs.ray.io/en/latest/train/user-guides/fault-tolerance.html#auto-resume

Jun 03 '24 06:06 sathyanarays

kuberay kuberay copied to clipboard

[Feature] Checkpoint API to recover from checkpoint from previous runs

Search before asking

Description

Steps to reproduce

1. Trigger the training RayJob

2. Kill the head pod

3. The new driver ignores the checkpoint

Hacky Fix

Use case

Related issues

Are you willing to submit a PR?

kuberay
kuberay copied to clipboard