codeflare-sdk icon indicating copy to clipboard operation
codeflare-sdk copied to clipboard

Added Ray Train & Pytorch Lightning demo

Open Bobbins228 opened this issue 1 year ago • 5 comments

Issue link

RHOAIENG-7805

What changes have been made

Added a demo notebook and python script based on the Ray Train & Pytorch Lightning example provided by Ray.

Verification steps

Setup

Notebook server ODH/RHOAI/Local

  • Clone this repository with git clone https://github.com/project-codeflare/codeflare-sdk.git
  • Checkout this PR's branch
  • Run pip install codeflare-sdk
  • Restart your notebook kernel

Testing

Run through the entire demo notebook. Test the minio and S3 persistent storage examples separately by following the comments in pytorch_lightning.py

A few things to note:

  • You must have 1 GPU per worker/head pod
  • It takes around 5 minutes to complete
  • This PR should not be merged until the PRs #530 & #563 are merged

Checks

  • [ ] I've made sure the tests are passing.
  • Testing Strategy
    • [ ] Unit tests
    • [x] Manual tests
    • [ ] Testing is not required for this change

Bobbins228 avatar Jun 10 '24 14:06 Bobbins228

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: varshaprasad96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jul 08 '24 17:07 openshift-ci[bot]

@varshaprasad96 The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?

Bobbins228 avatar Jul 09 '24 08:07 Bobbins228

The only changes needed would have been the addition of S3 or minio storage. Is that what you had to change?

That, and I'm not exactly sure of the right steps to be able to run these notebooks. I had to create a separate venv, install all the deps, change references to import and run this. Is there something I was missing while configuring to be able to reproduce the demos?

varshaprasad96 avatar Jul 09 '24 09:07 varshaprasad96

On RHOAI in your workbench you should be able to clone the repo and this PR branch via a terminal. You can then install the latest version of the SDK as no SDK changes were made just nbs. Then you would have been able to run the demo within the workbench after restarting the nb kernel. What steps did you follow?

Bobbins228 avatar Jul 09 '24 09:07 Bobbins228

I see! I had been using an ROSA cluster, manually installing the components (not through OpenShift AI operator) and trying to run the examples. This seems similar to what you mentioned. Will check it out again!

varshaprasad96 avatar Jul 09 '24 12:07 varshaprasad96