codeflare-sdk
codeflare-sdk copied to clipboard
Refactor Ray Cluster/AppWrapper creation
Issue link
Closes: RHOAIENG-10385 and RHOAIENG-8846
What changes have been made
- Removed base_template.yaml
- Removed generate_yaml.py in favour of build_ray_cluster.py
- Added V1RayCluster API
- Removed unneeded labels
- Removed head_info
- Removed machine_types
- Changed app_wrapper_yaml to resource_yaml and removed app_wrapper_name
- Made
ray_versiona variable for potential future automation - Set the
imageconfig variable to default toquay.io/rhoai/ray:2.23.0-py39-cu121 - Moved is_openshift_cluster function to cluster.py
- Changed create_app_wrapper to
create_resource - Removed enableIngress catch logic
- Removed app_wrapper_name
- Removed _components_resources_down in favour of _delete_resources
- Refactored the
get_clustermethod to generate a new ClusterConfiguration with just thenameandnamespaceof the cluster and retrieved yaml. Addedis_appwrapperbool so that users can get AppWrappers/Ray Clusters - Added
_retrieved_clusterboolean forget_clustercommand to avoid generating a "false" Ray Cluster viaClusterConfiguration - Updated
Cluster Configurationdocumentation and added new doc for methods used when interacting with Ray Clusters/AppWrappers.
Verification steps
Setup
Notebook server ODH/RHOAI/Local
- Clone this repository with
git clone https://github.com/project-codeflare/codeflare-sdk.git - Checkout this PR's branch
- Run
poetry build- install if needed (pip install poetry) - Run
pip install --force-reinstall dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl - Restart your notebook kernel
Testing
All ClusterConfiguration parameters must be tested with the new cluster creation method.
Keep a special eye out for the following as they were the most complex to implement:
- worker_extended_resource_requests
- head_extended_resource_requests
- extended_resource_mapping
- The notebook annotations.
Automated Notebook testing should cover the functionality changed but I still suggest all parameters should be human verified.
Test the new and improved get_cluster() function.
NOTE: You can compare the original & retrieved clusters by setting write_to_file=True on ClusterConfiguration and get_cluster()
- Create a Ray Cluster
- Restart your notebook kernel
- Run
cluster = get_cluster(cluster_name=<name>, namespace=<namespace>, is_appwrapper=False, write_to_file=True) - Run through the basic
cluster.methods - Run
cluster.down()thencluster.up() - Ensure the cluster is created successfully
- Repeat steps for AppWrapper (Remember to enable AppWrappers in the CFO configmap!)
TODO
- Add
ImagePullSecrets# Done - Revise
V1RayCluster
Checks
- [x] I've made sure the tests are passing.
- Testing Strategy
- [x] Unit tests
- [x] Manual tests
- [ ] Testing is not required for this change