codeflare-sdk icon indicating copy to clipboard operation
codeflare-sdk copied to clipboard

Refactor Ray Cluster/AppWrapper creation

Open Bobbins228 opened this issue 1 year ago • 2 comments

Issue link

Closes: RHOAIENG-10385 and RHOAIENG-8846

What changes have been made

  • Removed base_template.yaml
  • Removed generate_yaml.py in favour of build_ray_cluster.py
  • Added V1RayCluster API
  • Removed unneeded labels
  • Removed head_info
  • Removed machine_types
  • Changed app_wrapper_yaml to resource_yaml and removed app_wrapper_name
  • Made ray_version a variable for potential future automation
  • Set the image config variable to default to quay.io/rhoai/ray:2.23.0-py39-cu121
  • Moved is_openshift_cluster function to cluster.py
  • Changed create_app_wrapper to create_resource
  • Removed enableIngress catch logic
  • Removed app_wrapper_name
  • Removed _components_resources_down in favour of _delete_resources
  • Refactored the get_cluster method to generate a new ClusterConfiguration with just the name and namespace of the cluster and retrieved yaml. Added is_appwrapper bool so that users can get AppWrappers/Ray Clusters
  • Added _retrieved_cluster boolean for get_cluster command to avoid generating a "false" Ray Cluster via ClusterConfiguration
  • Updated Cluster Configuration documentation and added new doc for methods used when interacting with Ray Clusters/AppWrappers.

Verification steps

Setup

Notebook server ODH/RHOAI/Local

  • Clone this repository with git clone https://github.com/project-codeflare/codeflare-sdk.git
  • Checkout this PR's branch
  • Run poetry build - install if needed (pip install poetry)
  • Run pip install --force-reinstall dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl
  • Restart your notebook kernel

Testing

All ClusterConfiguration parameters must be tested with the new cluster creation method.

Keep a special eye out for the following as they were the most complex to implement:

  • worker_extended_resource_requests
  • head_extended_resource_requests
  • extended_resource_mapping
  • The notebook annotations.

Automated Notebook testing should cover the functionality changed but I still suggest all parameters should be human verified.

Test the new and improved get_cluster() function.

NOTE: You can compare the original & retrieved clusters by setting write_to_file=True on ClusterConfiguration and get_cluster()

  • Create a Ray Cluster
  • Restart your notebook kernel
  • Run cluster = get_cluster(cluster_name=<name>, namespace=<namespace>, is_appwrapper=False, write_to_file=True)
  • Run through the basic cluster. methods
  • Run cluster.down() then cluster.up()
  • Ensure the cluster is created successfully
  • Repeat steps for AppWrapper (Remember to enable AppWrappers in the CFO configmap!)

TODO

  • Add ImagePullSecrets # Done
  • Revise V1RayCluster

Checks

  • [x] I've made sure the tests are passing.
  • Testing Strategy
    • [x] Unit tests
    • [x] Manual tests
    • [ ] Testing is not required for this change

Bobbins228 avatar Aug 27 '24 11:08 Bobbins228