codeflare-sdk
codeflare-sdk copied to clipboard
Added Ray Train & Pytorch Lightning demo
Issue link
What changes have been made
Added a demo notebook and python script based on the Ray Train & Pytorch Lightning example provided by Ray.
Verification steps
Setup
Notebook server ODH/RHOAI/Local
- Clone this repository with
git clone https://github.com/project-codeflare/codeflare-sdk.git - Checkout this PR's branch
- Run
pip install codeflare-sdk - Restart your notebook kernel
Testing
Run through the entire demo notebook.
Test the minio and S3 persistent storage examples separately by following the comments in pytorch_lightning.py
A few things to note:
- You must have 1 GPU per worker/head pod
- It takes around 5 minutes to complete
- This PR should not be merged until the PRs #530 & #563 are merged
Checks
- [ ] I've made sure the tests are passing.
- Testing Strategy
- [ ] Unit tests
- [x] Manual tests
- [ ] Testing is not required for this change
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: varshaprasad96
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [varshaprasad96]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@varshaprasad96 The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?
The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?
That, and I'm not exactly sure of the right steps to be able to run these notebooks. I had to create a separate venv, install all the deps, change references to import and run this. Is there something I was missing while configuring to be able to reproduce the demos?
On RHOAI in your workbench you should be able to clone the repo and this PR branch via a terminal. You can then install the latest version of the SDK as no SDK changes were made just nbs. Then you would have been able to run the demo within the workbench after restarting the nb kernel. What steps did you follow?
I see! I had been using an ROSA cluster, manually installing the components (not through OpenShift AI operator) and trying to run the examples. This seems similar to what you mentioned. Will check it out again!