ray icon indicating copy to clipboard operation
ray copied to clipboard

[cluster][doc] A tutorial for ML training with GPU on Kubernetes

Open kevin85421 opened this issue 3 years ago • 1 comments
trafficstars

Why are these changes needed?

This PR provides a guide for ML training with GPU on Kubernetes. The guide shows users how to build a Kubernetes cluster, create a ray cluster, and submit a PyTorch image training job. We will run Ray's PyTorch image training benchmark with a 1 gigabyte training set.

Related issue number

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

kevin85421 avatar Sep 07 '22 20:09 kevin85421

cc @DmitriGekhtman

kevin85421 avatar Sep 08 '22 17:09 kevin85421

@ray-project/ray-docs could someone review? (Not sure if the above @ mention notifies the team members, but let's try.)

DmitriGekhtman avatar Sep 27 '22 01:09 DmitriGekhtman

I have some small suggestions. This is a good first cut.

To make this great going forward, we would ideally have: (a) instructions for AWS too and (b) have this tested with Ray Release tests. Otherwise there is a high chance it will break over time.

@pcmoritz Thank you for the recommendations! I have opened two issues to track the progress of (1) AWS instructions (2) Ray Release Tests.

kevin85421 avatar Sep 29 '22 18:09 kevin85421