kuberay icon indicating copy to clipboard operation
kuberay copied to clipboard

[Feature] Ray Serve CR and Controller

Open simon-mo opened this issue 2 years ago • 1 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

We would like to contribute a controller embedded in kuberay the operate a Ray Serve application on top of kuberay cluster.

apiVersion: serve.ray.io/v1
kind: ServingCluster
metadata:
  name: .
  # status is populated by the operator, user can `kubectl get serve_deployments name | jq .medata.status` to receive the field.
  status: UPDATING|HEALTHY|UNHEALTHY
spec:
  healthCheckConfig:  # optional
    health_period_s: 5s
    consecutive_failures_threshold: 3

  serveConfig:
    - deploymentClass: .
      numReplicas: 2
      rayActorOptions: .

  rayClusterConfig:
    apiVersion: cluster.ray.io/v1
    kind: RayCluster
    metadata:
      generatedName: .
    spec:
      maxWorkers: 2
      podTypes:
        - name: head
          rayResources: .
          podConfig:
            apiVersion: 1
            kind: Pod
            metadata:
              generatedName: .
            spec:
              containers:
                - name: ray-node
                  image: my_registry/container:v1

This operator performs health checks, initial and redeployment of Serve app on kuberay cluster, and rotate cluster if the Serve application fails. The CR will exposes health checking status of Serve application.

You can find more information from this design doc

Conceptually this is similar to SparkJob and FlinkJob in their respective operator. It is a high level concept built on top of existing CRs.

Comparing to the Ray Jobs controller/CR design, service CR is designed to be long running and should outlive cluster failure. However, both workload uses Ray's REST API endpoint to perform operation on the Ray cluster.

Use case

  • Deploy Ray Serve application reliability on K8s cluster.
  • Manage Serve application in a cloud native way
  • Entrypoint to highly available application on Ray

Related issues

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

simon-mo avatar Mar 26 '22 04:03 simon-mo

Hi. We have finalized some discuss about the design of the new k8s operator for Serve Deployment and RayCluster management. Here is our design doc. We would like to hear the feedbacks from the committee to make the alignment. Also @simon-mo . One example thing we want to discuss is how to add this new operator, how should the repo package structure look like.

brucez-anyscale avatar May 16 '22 22:05 brucez-anyscale

This has been done.

DmitriGekhtman avatar Jan 14 '23 01:01 DmitriGekhtman