ray
ray copied to clipboard
[Docs][KubeRay] add example for distributed checkpointing with kuberay and gcsfuse
Why are these changes needed?
Add a guide on using KubeRay and GCSFuse for distributed checkpointing. The example referenced in this doc depends on https://github.com/ray-project/kuberay/pull/2107
Related issue number
Checks
- [X] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
@aslonnie I will review this one. I have already had an offline sync with Andrew today.
Did a full run through of this example again:
$ gcloud container clusters create kuberay-gcsfuse --addons GcsFuseCsiDriver --cluster-version=1.29.4 --location=us-east4-c --machine-type=g2-standard-8 --release-channel=rapid --num-nodes=4 --accelerator type=nvidia-l4,count=1,gpu-driver-version=latest --workload-pool=${PROJECT_ID}.svc.id.goog
...
...
kubeconfig entry generated for kuberay-gcsfuse.
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
kuberay-gcsfuse us-east4-c 1.29.4-gke.1670000 35.245.233.26 g2-standard-8 1.29.4-gke.1670000 4 RUNNING
$ kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
NAME GPU
gke-kuberay-gcsfuse-default-pool-22f68f06-4plm 1
gke-kuberay-gcsfuse-default-pool-22f68f06-5fs1 1
gke-kuberay-gcsfuse-default-pool-22f68f06-9fvl 1
gke-kuberay-gcsfuse-default-pool-22f68f06-bc38 1
$ helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
NAME: kuberay-operator
LAST DEPLOYED: Tue May 21 16:14:41 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
$ BUCKET=kuberay-gcsfuse-test
$ gcloud storage buckets create gs://$BUCKET
Creating gs://kuberay-gcsfuse-test/...
$ kubectl create serviceaccount pytorch-distributed-training
serviceaccount/pytorch-distributed-training created
PROJECT_ID=...
PROJECT_NUMBER=...
gcloud storage buckets add-iam-policy-binding gs://${BUCKET} --member "principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/${PROJECT_ID}.svc.id.goog/subject/ns/default/sa/pytorch-distributed-training" --role "roles/storage.objectUser"
...
$ curl -LO https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/pytorch-resnet-image-classifier/ray-job.pytorch-image-classifier.yaml
$ sed -i "s/GCS_BUCKET/$BUCKET/g" ray-job.pytorch-image-classifier.yaml
$ kubectl create -f ray-job.pytorch-image-classifier.yaml
$ kubectl get po
NAME READY STATUS RESTARTS AGE
kuberay-operator-7d7998bcdb-m9qz4 1/1 Running 0 13m
lassifier-frxm9-raycluster-82gmf-worker-gpu-group-9mckk 2/2 Running 0 8m25s
lassifier-frxm9-raycluster-82gmf-worker-gpu-group-m76cg 2/2 Running 0 8m25s
lassifier-frxm9-raycluster-82gmf-worker-gpu-group-nkmmr 2/2 Running 0 8m25s
lassifier-frxm9-raycluster-82gmf-worker-gpu-group-z4hq5 2/2 Running 0 8m25s
orch-image-classifier-frxm9-raycluster-82gmf-head-vsdks 2/2 Running 0 8m25s
pytorch-image-classifier-frxm9-hhm6w 1/1 Running 0 2m49s
$ kubectl logs -f pytorch-image-classifier-frxm9-hhm6w
...
...
Training finished iteration 10 at 2024-05-21 09:28:49. Total running time: 59s
╭─────────────────────────────────────────╮
│ Training result │
├─────────────────────────────────────────┤
│ checkpoint_dir_name checkpoint_000009 │
│ time_this_iter_s 3.37964 │
│ time_total_s 43.10054 │
│ training_iteration 10 │
│ acc 0.22876 │
│ loss 0.07504 │
╰─────────────────────────────────────────╯
Training saved a checkpoint for iteration 10 at: (local)/mnt/cluster_storage/finetune-resnet/TorchTrainer_107e7_00000_0_2024-05-21_09-27-49/checkpoint_000009
2024-05-21 09:28:50,389 WARNING util.py:202 -- The `process_trial_save` operation took 0.804 s, which may be a performance bottleneck.
Training completed after 10 iterations at 2024-05-21 09:28:51. Total running time: 1min 1s
2024-05-21 09:28:53,599 WARNING experiment_state.py:323 -- Experiment checkpoint syncing has been triggered multiple times in the last 30.0 seconds. A sync will be triggered whenever a trial has checkpointed more than `num_to_keep` times since last sync or if 300 seconds have passed since last sync. If you have set `num_to_keep` in your `CheckpointConfig`, consider increasing the checkpoint frequency or keeping more checkpoints. You can supress this warning by changing the `TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S` environment variable.
Result(
metrics={'loss': 0.07503842508870792, 'acc': 0.22875816993464052},
path='/mnt/cluster_storage/finetune-resnet/TorchTrainer_107e7_00000_0_2024-05-21_09-27-49',
filesystem='local',
checkpoint=Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune-resnet/TorchTrainer_107e7_00000_0_2024-05-21_09-27-49/checkpoint_000009)
)
(RayTrainWorker pid=778, ip=10.36.0.13) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/mnt/cluster_storage/finetune-resnet/TorchTrainer_107e7_00000_0_2024-05-21_09-27-49/checkpoint_000009) [repeated 3x across cluster]
2024-05-21 09:28:55,878 SUCC cli.py:60 -- ----------------------------------------------------
2024-05-21 09:28:55,878 SUCC cli.py:61 -- Job 'pytorch-image-classifier-frxm9-z952p' succeeded
2024-05-21 09:28:55,878 SUCC cli.py:62 -- ----------------------------------------------------
After the first RayJob succeeded, I deleted it and created a new one. The new RayJob starts from the newest checkpoint successfully.
The train code is referenced in https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/pytorch-resnet-image-classifier/fine-tune-pytorch-resnet-image-classifier.py
There's also a link under the "Deploy the RayJob" section