ray
ray copied to clipboard
[cluster][doc] A tutorial for ML training with GPU on Kubernetes
Why are these changes needed?
This PR provides a guide for ML training with GPU on Kubernetes. The guide shows users how to build a Kubernetes cluster, create a ray cluster, and submit a PyTorch image training job. We will run Ray's PyTorch image training benchmark with a 1 gigabyte training set.
Related issue number
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
cc @DmitriGekhtman
@ray-project/ray-docs could someone review? (Not sure if the above @ mention notifies the team members, but let's try.)
I have some small suggestions. This is a good first cut.
To make this great going forward, we would ideally have: (a) instructions for AWS too and (b) have this tested with Ray Release tests. Otherwise there is a high chance it will break over time.
@pcmoritz Thank you for the recommendations! I have opened two issues to track the progress of (1) AWS instructions (2) Ray Release Tests.