ray [Docs][KubeRay] add example for distributed checkpointing with kuberay and gcsfuse

trafficstars

Why are these changes needed?

Add a guide on using KubeRay and GCSFuse for distributed checkpointing. The example referenced in this doc depends on https://github.com/ray-project/kuberay/pull/2107

Related issue number

Checks

[X] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[ ] I've run scripts/format.sh to lint the changes in this PR.
[ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
[ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(

Apr 30 '24 02:04 andrewsykim

@aslonnie I will review this one. I have already had an offline sync with Andrew today.

May 01 '24 00:05 kevin85421

Did a full run through of this example again:

$ gcloud container clusters create kuberay-gcsfuse     --addons GcsFuseCsiDriver     --cluster-version=1.29.4     --location=us-east4-c     --machine-type=g2-standard-8     --release-channel=rapid     --num-nodes=4     --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest     --workload-pool=${PROJECT_ID}.svc.id.goog 
...
...
kubeconfig entry generated for kuberay-gcsfuse.
NAME             LOCATION    MASTER_VERSION      MASTER_IP      MACHINE_TYPE   NODE_VERSION        NUM_NODES  STATUS
kuberay-gcsfuse  us-east4-c  1.29.4-gke.1670000  35.245.233.26  g2-standard-8  1.29.4-gke.1670000  4          RUNNING

$ kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
NAME                                             GPU
gke-kuberay-gcsfuse-default-pool-22f68f06-4plm   1
gke-kuberay-gcsfuse-default-pool-22f68f06-5fs1   1
gke-kuberay-gcsfuse-default-pool-22f68f06-9fvl   1
gke-kuberay-gcsfuse-default-pool-22f68f06-bc38   1

$ helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
NAME: kuberay-operator
LAST DEPLOYED: Tue May 21 16:14:41 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

$ BUCKET=kuberay-gcsfuse-test
$ gcloud storage buckets create gs://$BUCKET
Creating gs://kuberay-gcsfuse-test/...

$ kubectl create serviceaccount pytorch-distributed-training
serviceaccount/pytorch-distributed-training created

PROJECT_ID=...
PROJECT_NUMBER=...
gcloud storage buckets add-iam-policy-binding gs://${BUCKET} --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/default/sa/pytorch-distributed-training"  --role "roles/storage.objectUser"
...

$ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-resnet-image-classifier/ray-job.pytorch-image-classifier.yaml
$ sed -i "s/GCS_BUCKET/$BUCKET/g" ray-job.pytorch-image-classifier.yaml
$ kubectl create -f ray-job.pytorch-image-classifier.yaml

$ kubectl get po 
NAME                                                      READY   STATUS    RESTARTS   AGE
kuberay-operator-7d7998bcdb-m9qz4                         1/1     Running   0          13m
lassifier-frxm9-raycluster-82gmf-worker-gpu-group-9mckk   2/2     Running   0          8m25s
lassifier-frxm9-raycluster-82gmf-worker-gpu-group-m76cg   2/2     Running   0          8m25s
lassifier-frxm9-raycluster-82gmf-worker-gpu-group-nkmmr   2/2     Running   0          8m25s
lassifier-frxm9-raycluster-82gmf-worker-gpu-group-z4hq5   2/2     Running   0          8m25s
orch-image-classifier-frxm9-raycluster-82gmf-head-vsdks   2/2     Running   0          8m25s
pytorch-image-classifier-frxm9-hhm6w                      1/1     Running   0          2m49s

$ kubectl logs -f pytorch-image-classifier-frxm9-hhm6w
...
...
Training finished iteration 10 at 2024-05-21 09:28:49. Total running time: 59s
╭─────────────────────────────────────────╮
│ Training result                         │
├─────────────────────────────────────────┤
│ checkpoint_dir_name   checkpoint_000009 │
│ time_this_iter_s                3.37964 │
│ time_total_s                   43.10054 │
│ training_iteration                   10 │
│ acc                             0.22876 │
│ loss                            0.07504 │
╰─────────────────────────────────────────╯
Training saved a checkpoint for iteration 10 at: (local)/mnt/cluster_storage/finetune-resnet/TorchTrainer_107e7_00000_0_2024-05-21_09-27-49/checkpoint_000009
2024-05-21 09:28:50,389 WARNING util.py:202 -- The `process_trial_save` operation took 0.804 s, which may be a performance bottleneck.

Training completed after 10 iterations at 2024-05-21 09:28:51. Total running time: 1min 1s
2024-05-21 09:28:53,599 WARNING experiment_state.py:323 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.

Result(
  metrics={'loss': 0.07503842508870792, 'acc': 0.22875816993464052},
  path='/mnt/cluster_storage/finetune-resnet/TorchTrainer_107e7_00000_0_2024-05-21_09-27-49',
  filesystem='local',
  checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune-resnet/TorchTrainer_107e7_00000_0_2024-05-21_09-27-49/checkpoint_000009)
)
(RayTrainWorker pid=778, ip=10.36.0.13) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune-resnet/TorchTrainer_107e7_00000_0_2024-05-21_09-27-49/checkpoint_000009) [repeated 3x across cluster]
2024-05-21 09:28:55,878 SUCC cli.py:60 -- ----------------------------------------------------
2024-05-21 09:28:55,878 SUCC cli.py:61 -- Job 'pytorch-image-classifier-frxm9-z952p' succeeded
2024-05-21 09:28:55,878 SUCC cli.py:62 -- ----------------------------------------------------

May 21 '24 16:05 andrewsykim

After the first RayJob succeeded, I deleted it and created a new one. The new RayJob starts from the newest checkpoint successfully. Screenshot 2024-05-21 at 1 55 59 PM

May 21 '24 20:05 kevin85421

The train code is referenced in https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/pytorch-resnet-image-classifier/fine-tune-pytorch-resnet-image-classifier.py

There's also a link under the "Deploy the RayJob" section

May 25 '24 04:05 andrewsykim

ray ray copied to clipboard

[Docs][KubeRay] add example for distributed checkpointing with kuberay and gcsfuse

Why are these changes needed?

Related issue number

Checks

ray
ray copied to clipboard