codeflare-sdk
codeflare-sdk copied to clipboard
SDK Info Exposure Rework (status/details, wait, error-handling, etc.)
Merge both cluster.status() and cluster.is_ready() into one one function (likely still called status). This will tell the user exactly where in the process of setup their cluster is currently (whether still in AppWrapper stages or in Ray stages). Then, there will be a second function called cluster.details() that will output the cluster information like all of the specs, worker count, uri, active/inactive, etc. (what we currently see when calling cluster.status() on a fully set-up cluster).
Also add a wait() function (likely a simple loop checking using the above mentioned status() function)
Change status() return from bool to info object
The Ray cluster is missing the status.state https://github.com/ray-project/kuberay/issues/991
oc create -k "github.com/ray-project/kuberay/ray-operator/config/default?ref=v0.5.0&timeout=90s"
The status from a RayCluster shows:
status:
availableWorkerReplicas: 2
desiredWorkerReplicas: 1
endpoints:
client: "10001"
dashboard: "8265"
gcs: "6379"
head:
serviceIP: 172.21.234.58
lastUpdateTime: "2023-06-05T20:01:31Z"
maxWorkerReplicas: 1
minWorkerReplicas: 1
This causes a problem for the codeflare API https://github.com/project-codeflare/codeflare-sdk/blob/main/src/codeflare_sdk/cluster/cluster.py#L428 where it looks for the state causing the cluster to stay as STARTING (<CodeFlareClusterStatus.STARTING: 2>, False)
Please ignore my previous comment, it works with the 0.5.0 (I was using 0.4.0 previously)