kuberay [Feature] Support Istio

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

Support Istio with KubeRay

Use case

https://github.com/kubeflow/manifests/pull/2383

Related issues

#948

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Apr 03 '23 21:04 kevin85421

@kevin85421 are there any internal plans to support this?

Apr 28 '23 08:04 juliusvonkohout

cc @Yicheng-Lu-llll

May 12 '23 17:05 kevin85421

@kevin85421 if this supports then could it be possible to submit the jobs using external DNS ? I have tried but no luck.

I have updated my query here : https://discuss.ray.io/t/need-assistance-to-submit-job-using-external-dns/13354

Jan 14 '24 20:01 sip-aravind-g

Hi @kevin85421,

After digging into this for a couple of days, I have managed to make Ray work with Istio on three different levels with some modifications in KubeRay.

Level 1: Plain TCP

This is the basic level of Istio integration that doesn't do any traffic encryption and should work out of the box with the current KubeRay.

One thing to note is that, with the latest Istio 1.21.0, users MUST apply the following Istio config to let the Istio sidecar proxy the plain grpc traffic correctly, or the Ray worker node won't be able to start.

  meshConfig:
    defaultConfig:
      runtimeValues:
        envoy.reloadable_features.sanitize_te: "false"

There is an issue opened for this https://github.com/istio/istio/issues/49685

Level 2: mTLS

This is the second level integration that Istio must learn L4 information about Ray to do traffic encryption. That is when a TCP packet goes in or out of an Istio sidecar, Istio must be able to know where it does come from or go from its source or destination ip+port tuple. And, In Ray, a pod will connect to another pod directly without the help of a virtual IP like a normal k8s service. So, we must apply a headless service to all Ray pods to let Istio know these L4 information correctly, for example:

apiVersion: v1
kind: Service
metadata:
  labels:
    ray.io/headless-worker-svc: raycluster-mini
  name: raycluster-mini-headless-svc
  namespace: default
spec:
  clusterIP: None
  selector:
    ray.io/cluster: raycluster-mini
  ports:
  - name: node-manager-port
    port: 6380
  - name: object-manager-port
    port: 6381
  - name: runtime-env-agent-port
    port: 6382
  - name: dashboard-agent-grpc-port
    port: 6383
  - name: dashboard-agent-listen-port
    port: 52365
  - name: metrics-export-port
    port: 8080
  - name: p10002
    port: 10002
  - name: p10003
    port: 10003
  - name: p10004
    port: 10004
  - name: p10005
    port: 10005
    ...

All the ports listed in https://docs.ray.io/en/latest/ray-core/configure.html#all-nodes, including worker ports, MUST be explicitly configured in the rayStartParams and carried to the headless service.

Furthermore, if the Istio is operated in the STRICT mode, there are two more changes should be done on the KubeRay side

Disable the wait-gcs-ready init container by setting the env ENABLE_INIT_CONTAINER_INJECTION=false. This init container will not be able to connect to the Ray GCS in Istio STRICT mode and will never be finished because it starts before the Istio sidecar.
KubeRay controller talks to the underlying RayCluster of a RayService and RayJob directly. In the case of Istio STRICT mode, the controller should also have an Istio sidecar deployed or it should talk to the RayCluster through the kubeapi proxy.

Level 3: L7 Visibilities

This is the strictest level of integration that Istio knows L7 details about Ray's internal traffic. We not only need to set appProtocol correctly in the above headless service but also let Ray use FQDN hostname to communicate with each other internally because Istio uses the Host/Authority header for L7 routing. That is, setting the --node-ip-address to every Ray node is needed.

Here is a quick and dirty hack to the KubeRay to enable mTLS+L7 Visibilities for a RayCluster named raycluster-mini:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-mini
spec:
  rayVersion: '2.9.0'
  headGroupSpec:
    rayStartParams:
      num-cpus: '1'
      node-manager-port: '6380'
      object-manager-port: '6381'
      runtime-env-agent-port: '6382'
      dashboard-agent-grpc-port: '6383'
      dashboard-grpc-port: '6384'
    #pod template
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.9.0
          ports:
          - containerPort: 6379
            name: redis
          - containerPort: 8265 # Ray dashboard
            name: dashboard
          - containerPort: 10001
            name: client
  workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 1
      groupName: small-group
      rayStartParams:
        num-cpus: '1'
        node-manager-port: '6380'
        object-manager-port: '6381'
        runtime-env-agent-port: '6382'
        dashboard-agent-grpc-port: '6383'
      #pod template
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray:2.9.0

The hack:

diff --git a/ray-operator/controllers/ray/common/pod.go b/ray-operator/controllers/ray/common/pod.go
index 0182ead..b79f7e5 100644
--- a/ray-operator/controllers/ray/common/pod.go
+++ b/ray-operator/controllers/ray/common/pod.go
@@ -339,6 +339,7 @@ func BuildPod(ctx context.Context, podTemplateSpec corev1.PodTemplateSpec, rayNo
 	ulimitCmd := "ulimit -n 65536"
 	// Generate the `ray start` command.
 	rayStartCmd := generateRayStartCommand(ctx, rayNodeType, rayStartParams, pod.Spec.Containers[utils.RayContainerIndex].Resources)
+	rayStartCmd += ` --node-ip-address $(echo "$POD_IP" | sed 's/\./-/g').raycluster-mini-headless-svc.default.svc.cluster.local`
 
 	// Check if overwrites the generated container command or not.
 	isOverwriteRayContainerCmd := false
@@ -580,7 +581,14 @@ func setContainerEnvVars(pod *corev1.Pod, rayNodeType rayv1.RayNodeType, rayStar
 			},
 		},
 	}
-	container.Env = append(container.Env, rayCloudInstanceID)
+	container.Env = append(container.Env, rayCloudInstanceID, corev1.EnvVar{
+		Name: "POD_IP",
+		ValueFrom: &corev1.EnvVarSource{
+			FieldRef: &corev1.ObjectFieldSelector{
+				FieldPath: "status.podIP",
+			},
+		},
+	})
 
 	// RAY_NODE_TYPE_NAME is used by Ray Autoscaler V2 (alpha). See https://github.com/ray-project/kuberay/issues/1965 for more details.
 	nodeGroupNameEnv := corev1.EnvVar{
diff --git a/ray-operator/controllers/ray/common/service.go b/ray-operator/controllers/ray/common/service.go
index 7ffb8f5..becc958 100644
--- a/ray-operator/controllers/ray/common/service.go
+++ b/ray-operator/controllers/ray/common/service.go
@@ -3,7 +3,9 @@ package common
 import (
 	"context"
 	"fmt"
+	"k8s.io/utils/pointer"
 	"sort"
+	"strings"
 
 	corev1 "k8s.io/api/core/v1"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@@ -68,6 +70,15 @@ func BuildServiceForHeadPod(ctx context.Context, cluster rayv1.RayCluster, label
 	ports := []corev1.ServicePort{}
 	for name, port := range ports_int {
 		svcPort := corev1.ServicePort{Name: name, Port: port, AppProtocol: &defaultAppProtocol}
+		switch port {
+		case 6379, 10001:
+			svcPort.AppProtocol = pointer.String("grpc")
+		case 8265:
+			svcPort.AppProtocol = pointer.String("http")
+		}
+		if strings.Contains(name, "grpc") {
+			svcPort.AppProtocol = pointer.String("grpc")
+		}
 		ports = append(ports, svcPort)
 	}
 	if cluster.Spec.HeadGroupSpec.HeadService != nil {
@@ -285,8 +296,7 @@ func BuildHeadlessServiceForRayCluster(rayCluster rayv1.RayCluster) (*corev1.Ser
 		utils.RayClusterHeadlessServiceLabelKey: rayCluster.Name,
 	}
 	selectorLabels := map[string]string{
-		utils.RayClusterLabelKey:  rayCluster.Name,
-		utils.RayNodeTypeLabelKey: string(rayv1.WorkerNode),
+		utils.RayClusterLabelKey: rayCluster.Name, // Apply headless service to all Ray nodes.
 	}
 
 	headlessService := &corev1.Service{
@@ -296,11 +306,51 @@ func BuildHeadlessServiceForRayCluster(rayCluster rayv1.RayCluster) (*corev1.Ser
 			Labels:    labels,
 		},
 		Spec: corev1.ServiceSpec{
-			ClusterIP: "None",
-			Selector:  selectorLabels,
-			Type:      corev1.ServiceTypeClusterIP,
+			ClusterIP:                "None",
+			Selector:                 selectorLabels,
+			Type:                     corev1.ServiceTypeClusterIP,
+			PublishNotReadyAddresses: true, // Ray head will not be able to start without this.
+			Ports: []corev1.ServicePort{
+				{ // All the following port definitions, including worker ports, should be parsed from the rayStartParams
+					Name:        "node-manager-port",
+					AppProtocol: pointer.String("grpc"),
+					Port:        6380,
+				},
+				{
+					Name:        "object-manager-port",
+					AppProtocol: pointer.String("grpc"),
+					Port:        6381,
+				},
+				{
+					Name:        "runtime-env-agent-port",
+					AppProtocol: pointer.String("grpc"),
+					Port:        6382,
+				},
+				{
+					Name:        "dashboard-agent-grpc-port",
+					AppProtocol: pointer.String("grpc"),
+					Port:        6383,
+				},
+				{
+					Name:        "dashboard-agent-listen-port",
+					AppProtocol: pointer.String("http"),
+					Port:        52365,
+				},
+				{
+					Name:        "metrics-export-port",
+					AppProtocol: pointer.String("http"),
+					Port:        8080,
+				},
+			},
 		},
 	}
+	for i := int32(10002); i < 10100; i++ {
+		headlessService.Spec.Ports = append(headlessService.Spec.Ports, corev1.ServicePort{
+			Name:        fmt.Sprintf("p%d", i),
+			AppProtocol: pointer.String("grpc"),
+			Port:        i,
+		})
+	}
 
 	return headlessService, nil
 }
diff --git a/ray-operator/controllers/ray/raycluster_controller.go b/ray-operator/controllers/ray/raycluster_controller.go
index 63b584c..3d51cec 100644
--- a/ray-operator/controllers/ray/raycluster_controller.go
+++ b/ray-operator/controllers/ray/raycluster_controller.go
@@ -593,6 +593,7 @@ func (r *RayClusterReconciler) reconcileHeadlessService(ctx context.Context, ins
 		}
 	}
 
+	isMultiHost = true // Always create the headless service
 	if isMultiHost {
 		services := corev1.ServiceList{}
 		options := common.RayClusterHeadlessServiceListOptions(instance)
diff --git a/ray-operator/controllers/ray/utils/constant.go b/ray-operator/controllers/ray/utils/constant.go
index 9bad224..0302324 100644
--- a/ray-operator/controllers/ray/utils/constant.go
+++ b/ray-operator/controllers/ray/utils/constant.go
@@ -85,8 +85,8 @@ const (
 	ComponentName = "kuberay-operator"
 
 	// The default suffix for Headless Service for multi-host worker groups.
-	// The full name will be of the form "${RayCluster_Name}-headless-worker-svc".
-	HeadlessServiceSuffix = "headless-worker-svc"
+	// The full name will be of the form "${RayCluster_Name}-headless-svc".
+	HeadlessServiceSuffix = "headless-svc"
 
 	// Use as container env variable
 	RAY_CLUSTER_NAME                        = "RAY_CLUSTER_NAME"

Summary

The above hacky patch made a couple of changes to KubeRay to enable mTLS+L7 visibilities with Istio:

Extend the current headless service for TPU Multi Host to all Ray nodes. This generates FQDN for each Ray node.
Hard code all port definitions to the headless service for this demo purpose. This fulfills the criteria of auto mTLS.
Specify --node-ip-address FQDN to all Ray nodes. This, along with the correct appProtocol, fulfills the criteria of L7 visibilities.

Related issues: https://github.com/ray-project/kuberay/issues/948 https://github.com/ray-project/kuberay/issues/1025 https://github.com/ray-project/kuberay/issues/1946

Apr 07 '24 08:04 rueian

Discussions

Although the above patch doesn't change too many lines, it brings some critical issues:

All the ports must be explicitly stated in the rayStartParams but the default worker port range is too large. I have tried to state all the default 10002~19999 ports in my local Kind cluster and they caused all pods to be killed by OOM. We probably should document this memory issue and recommend users use a smaller worker ports range if they want to enable auto mTLS.
The --node-ip-address FQDN option may bring a compatibility issue: The old KubeDNS uses <pod-ip>.<headless-svc>.<ns>.cluster.local, but the new spec uses <pod-hostname>.<headless-svc>.<ns>.cluster.local. The CoreDNS keeps the former format for backward compatibility, but other implementations may only accept the latter one.

Apr 07 '24 15:04 rueian

Thank you @rueian i hope that https://github.com/istio/istio/issues/49685 gets fixed soon then.

Apr 08 '24 05:04 juliusvonkohout

Relevant KEP in Kubernetes: https://github.com/kubernetes/enhancements/pull/2611

There were some previous discussions around support port ranges as well, but I can't seem to find that discussion

May 08 '24 21:05 andrewsykim

https://github.com/ray-project/ray/pull/44801 Ruei-An has already added a doc for Istio. Close this issue.

Jul 12 '24 22:07 kevin85421

@rueian Do these changes need a specific version of kube-dns ?

Sep 12 '24 14:09 t-indumathy

From my previous testing of kuberay and Istio, it's not supported with kube-dns because kube-dns does not support Pod FQDN resolution for headless services. This is however supported with CoreDNS and most other DNS providers. @rueian can correct me if I'm wrong

Sep 12 '24 14:09 andrewsykim

@andrewsykim Is it because the format in kube-dns supports only hostname and not pod-ip in FQDN?

Sep 12 '24 15:09 t-indumathy

Yeah, it doesn't support the <POD IP ADDRESS>.raycluster-istio-headless-svc.default.svc.cluster.local that is referenced in https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/istio.html.

I haven't tested if there are other viable workarounds

Sep 12 '24 15:09 andrewsykim

Hi @andrewsykim and @t-indumathy,

I find a note regarding hostname A records for Pod here https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields:

So the current workaround, if your kube-dns doesn't support the old Pod IP FQDN, is setting the hostname and subdomain fields on Pods manually to generate hostname A records.

This workaround is quite awkward since we can have only ONE replica in a worker group in order to set Pod hostnames manually. Here is a modified example from the KubeRay Istio tutorial that applies the workaround in the headGroupSpec and the workerGroupSpecs.

kubectl apply -f - <<EOF
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: raycluster-istio
spec:
  rayVersion: '2.35.0'
  headGroupSpec:
    rayStartParams:
      num-cpus: '1'
      node-manager-port: '6380'
      object-manager-port: '6381'
      runtime-env-agent-port: '6382'
      dashboard-agent-grpc-port: '6383'
      dashboard-agent-listen-port: '52365'
      metrics-export-port: '8080'
      max-worker-port: '10012'
      node-ip-address: $(hostname --fqdn)
    template:
      spec:
        hostname: head
        subdomain: raycluster-istio-headless-svc
        containers:
        - name: ray-head
          image: rayproject/ray:2.35.0
  workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 1
      groupName: small-group
      rayStartParams:
        num-cpus: '1'
        node-manager-port: '6380'
        object-manager-port: '6381'
        runtime-env-agent-port: '6382'
        dashboard-agent-grpc-port: '6383'
        dashboard-agent-listen-port: '52365'
        metrics-export-port: '8080'
        max-worker-port: '10012'
        node-ip-address: $(hostname --fqdn)
      template:
        spec:
          hostname: worker-1
          subdomain: raycluster-istio-headless-svc
          containers:
          - name: ray-worker
            image: rayproject/ray:2.35.0
EOF

Sep 13 '24 06:09 rueian

Thanks @rueian ! Will test this.

Sep 13 '24 11:09 t-indumathy

kuberay kuberay copied to clipboard

[Feature] Support Istio

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

Level 1: Plain TCP

Level 2: mTLS

Level 3: L7 Visibilities

Summary

Discussions

kuberay
kuberay copied to clipboard