kuberay
kuberay copied to clipboard
Update v6e-256 KubeRay Sample
Why are these changes needed?
This PR adds recommended fields to the v6e-256 RayCluster and RayJob sample manifests. For the larger slice size, adding privileged: true
resolves a UNKNOWN: TPU initialization failed: open(/dev/vfio/vfio): No such file or directory: No such file or directory; Couldn't open vfio container /dev/vfio/vfio
error while adding resources: '"{\"TPU\": 4}"'
to the rayStartParams resolves a race condition that sometimes occurs in RayServices and RayJobs where Python script execution begins before TPU device detection by the Raylets, causing ray.available_resources()["TPU"]
to return 0.
This PR was manually tested as follows:
- Create the RayJob CR
kubectl apply -f https://raw.githubusercontent.com/ray-project/kuberay/master/ray-operator/config/samples/ray-job.tpu-v6e-256-multihost.yaml
- View the Job output:
kubectl logs -l=job-name=v6e-256-job
2024-10-23 02:37:52,871 INFO cli.py:39 -- Job submission server address: http://v6e-256-job-raycluster-xj4wj-head-svc.default.svc.cluster.local:8265
2024-10-23 02:37:53,716 SUCC cli.py:63 -- ----------------------------------------------
2024-10-23 02:37:53,716 SUCC cli.py:64 -- Job 'v6e-256-job-4mhms' submitted successfully
2024-10-23 02:37:53,716 SUCC cli.py:65 -- ----------------------------------------------
2024-10-23 02:37:53,716 INFO cli.py:289 -- Next steps
2024-10-23 02:37:53,716 INFO cli.py:290 -- Query the logs of the job:
2024-10-23 02:37:53,716 INFO cli.py:292 -- ray job logs v6e-256-job-4mhms
2024-10-23 02:37:53,716 INFO cli.py:294 -- Query the status of the job:
2024-10-23 02:37:53,716 INFO cli.py:296 -- ray job status v6e-256-job-4mhms
2024-10-23 02:37:53,716 INFO cli.py:298 -- Request the job to be stopped:
2024-10-23 02:37:53,716 INFO cli.py:300 -- ray job stop v6e-256-job-4mhms
2024-10-23 02:37:53,742 INFO cli.py:307 -- Tailing logs until the job exits (disable with --no-wait):
2024-10-23 02:37:53,447 INFO job_manager.py:528 -- Runtime env is setting up.
2024-10-23 02:38:14,855 INFO worker.py:1461 -- Using address 10.96.6.73:6379 set in the environment variable RAY_ADDRESS
2024-10-23 02:38:14,856 INFO worker.py:1601 -- Connecting to existing Ray cluster at address: 10.96.6.73:6379...
2024-10-23 02:38:14,870 INFO worker.py:1777 -- Connected to Ray cluster. View the dashboard at 10.96.6.73:8265
['TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256', 'TPU cores:256']
2024-10-23 02:38:44,974 SUCC cli.py:63 -- ---------------------------------
2024-10-23 02:38:44,974 SUCC cli.py:64 -- Job 'v6e-256-job-4mhms' succeeded
2024-10-23 02:38:44,974 SUCC cli.py:65 -- ---------------------------------
Related issue number
Checks
- [x] I've made sure the tests are passing.
- Testing Strategy
- [x] Unit tests
- [x] Manual tests
- [ ] This PR is not tested :(