Issue link

What changes have been made

Added a demo notebook and python script based on the Ray Train & Pytorch Lightning example provided by Ray.

Verification steps

Setup

Notebook server ODH/RHOAI/Local

Clone this repository with git clone https://github.com/project-codeflare/codeflare-sdk.git
Checkout this PR's branch
Run pip install codeflare-sdk
Restart your notebook kernel

Testing

Run through the entire demo notebook. Test the minio and S3 persistent storage examples separately by following the comments in pytorch_lightning.py

A few things to note:

You must have 1 GPU per worker/head pod
It takes around 5 minutes to complete
This PR should not be merged until the PRs #530 & #563 are merged

Checks

[ ] I've made sure the tests are passing.
Testing Strategy
- [ ] Unit tests
- [x] Manual tests
- [ ] Testing is not required for this change

Jun 10 '24 14:06 Bobbins228

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: varshaprasad96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [varshaprasad96]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jul 08 '24 17:07 openshift-ci[bot]

@varshaprasad96 The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?

Jul 09 '24 08:07 Bobbins228

The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?

That, and I'm not exactly sure of the right steps to be able to run these notebooks. I had to create a separate venv, install all the deps, change references to import and run this. Is there something I was missing while configuring to be able to reproduce the demos?

Jul 09 '24 09:07 varshaprasad96

On RHOAI in your workbench you should be able to clone the repo and this PR branch via a terminal. You can then install the latest version of the SDK as no SDK changes were made just nbs. Then you would have been able to run the demo within the workbench after restarting the nb kernel. What steps did you follow?

Jul 09 '24 09:07 Bobbins228

I see! I had been using an ROSA cluster, manually installing the components (not through OpenShift AI operator) and trying to run the examples. This seems similar to what you mentioned. Will check it out again!

Jul 09 '24 12:07 varshaprasad96

codeflare-sdk
codeflare-sdk copied to clipboard

Added Ray Train & Pytorch Lightning demo

Issue link

What changes have been made

Verification steps

Setup

Notebook server ODH/RHOAI/Local

Testing

Checks

codeflare-sdk codeflare-sdk copied to clipboard

Added Ray Train & Pytorch Lightning demo

Issue link

What changes have been made

Verification steps

Setup

Notebook server ODH/RHOAI/Local

Testing

Checks

codeflare-sdk
codeflare-sdk copied to clipboard