codeflare-sdk icon indicating copy to clipboard operation
codeflare-sdk copied to clipboard

SDK Info Exposure Rework (status/details, wait, error-handling, etc.)

Open Maxusmusti opened this issue 3 years ago • 4 comments

Merge both cluster.status() and cluster.is_ready() into one one function (likely still called status). This will tell the user exactly where in the process of setup their cluster is currently (whether still in AppWrapper stages or in Ray stages). Then, there will be a second function called cluster.details() that will output the cluster information like all of the specs, worker count, uri, active/inactive, etc. (what we currently see when calling cluster.status() on a fully set-up cluster).

Maxusmusti avatar Nov 21 '22 16:11 Maxusmusti

Also add a wait() function (likely a simple loop checking using the above mentioned status() function)

Maxusmusti avatar Nov 22 '22 16:11 Maxusmusti

Change status() return from bool to info object

Maxusmusti avatar Nov 22 '22 16:11 Maxusmusti

The Ray cluster is missing the status.state https://github.com/ray-project/kuberay/issues/991

oc create -k "github.com/ray-project/kuberay/ray-operator/config/default?ref=v0.5.0&timeout=90s"

The status from a RayCluster shows:

status:
  availableWorkerReplicas: 2
  desiredWorkerReplicas: 1
  endpoints:
    client: "10001"
    dashboard: "8265"
    gcs: "6379"
  head:
    serviceIP: 172.21.234.58
  lastUpdateTime: "2023-06-05T20:01:31Z"
  maxWorkerReplicas: 1
  minWorkerReplicas: 1

This causes a problem for the codeflare API https://github.com/project-codeflare/codeflare-sdk/blob/main/src/codeflare_sdk/cluster/cluster.py#L428 where it looks for the state causing the cluster to stay as STARTING (<CodeFlareClusterStatus.STARTING: 2>, False)

thinkahead avatar Jun 05 '23 21:06 thinkahead

Please ignore my previous comment, it works with the 0.5.0 (I was using 0.4.0 previously)

thinkahead avatar Jun 06 '23 21:06 thinkahead