kuberay [Bug] GKE CSI Fuse Mounts prevent worker pod creation

Search before asking

[X] I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

If one defines GKE CSI Fuse mounts to worker Pods, the init container never succeeds and finally fails with:

Error: context deadline exceeded
  Warning  Failed     12s                  kubelet            Error: failed to reserve container name "wait-gcs-ready_search...7fb_0": name "wait-gcs-ready_search...7fb_0" is reserved for "df8...d3a"

GKE version: v1.26.3-gke.1000

Mount works in head node.

I believe the root cause is that the gke-gcsfuse-sidecar container is needed for mounting the CSI fuse volume. It never starts, because it waits "PodInitializing" state. Pod can't initialize because wait-gcs-ready gets a clone of worker's volumemounts, including the fuse mount and combination leads to deadlock. wait-gcs-ready eventually fails with CreateContainerError.

Reproduction script

Applying following resource reproduces the issue provided that CSI, bucket and serviceaccount is set up (instructions here: https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver)

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: fuserepro
spec:
  rayVersion: '2.4.0'
  headGroupSpec:
    serviceType: ClusterIP
    rayStartParams:
      dashboard-host: '0.0.0.0'
      block: 'true'
    template:
      metadata:
        labels: {}
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray:2.4.0
          ports:
          - containerPort: 6379
            name: gcs
          - containerPort: 8265
            name: dashboard
          - containerPort: 10001
            name: client
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          volumeMounts:
            - mountPath: /tmp/ray
              name: ray-logs
          resources:
            limits:
              cpu: "2"
              memory: "4G"
            requests:
              cpu: "2"
              memory: "4G"
        volumes:
          - name: ray-logs
            emptyDir: {}
  workerGroupSpecs:
  - replicas: 1
    minReplicas: 1
    maxReplicas: 1
    groupName: cpu1
    rayStartParams:
      block: 'true'
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
      spec:
        terminationGracePeriodSeconds: 60
        serviceAccountName: sa-for-bucket-access
        containers:
        - name: ray-worker
          image: rayproject/ray:2.4.0
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          volumeMounts:
            - name: model-storage
              mountPath: /fuse
              readOnly: false
            - mountPath: /tmp/ray
              name: ray-logs
          resources:
            limits:
              cpu: 1
              memory: "2G"
            requests:
              cpu: 1
              memory: "2G"
        initContainers:
        - name: init
          image: busybox:1.28
          command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]
        volumes:
          - name: model-storage
            csi:
              driver: gcsfuse.csi.storage.gke.io
              readOnly: false
              volumeAttributes:
                bucketName: gcs-bucket
          - name: ray-logs
            emptyDir: {}

Anything else

KubeRay justifies volume copying to initcontainer and wait-gcs-ready as follows. Not sure what would be clean solution to avoid referencing to certain volumes during initialize?

		// This init container requires certain environment variables to establish a secure connection with the Ray head using TLS authentication.
		// Additionally, some of these environment variables may reference files stored in volumes, so we need to include both the `Env` and `VolumeMounts` fields here.
		// For more details, please refer to: https://docs.ray.io/en/latest/ray-core/configure.html#tls-authentication.
		Env:          podTemplate.Spec.Containers[rayContainerIndex].DeepCopy().Env,
		VolumeMounts: podTemplate.Spec.Containers[rayContainerIndex].DeepCopy().VolumeMounts

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

May 31 '23 07:05 jrosti

I've just hit a similar problem with the wait-gcs-ready container injection. Does this really have to be done in a separate container? There's already logic in BuildPod() doing different things depending on rayNodeType, why not just set the command in there for a worker to be:

 until ray health-check --address ${RAY_IP}:${RAY_PORT} > /dev/null 2>&1; do echo wait for GCS to be ready; sleep 5; done && ray start ...

Jun 01 '23 19:06 cread

You can check https://github.com/ray-project/kuberay/pull/1069 for a workaround. Basically you can set ENABLE_INIT_CONTAINER_INJECTION to false to avoid having the default init container and set your own init container.

Jun 02 '23 04:06 Yicheng-Lu-llll

Reopen this issue. I will check whether we should make the GKE CSI Fuse work with the default KubeRay config, or if updating the documentation is sufficient.

Oct 06 '23 22:10 kevin85421

@kevin85421 I'm curious about the status of this issue. We hit the same issue and wonder if it can work with the default KbueRay config. Disabling auto init container injection will be a quite big breaking change on our platform. I wonder if we can approach it with a more elegant solution. Thank you.

Jan 11 '24 20:01 daikeshi

Is the primary reason for volume mounting to read TLS ceritificates? Copying all volume mounts does seem unecessary, but I'm not sure how we would check which ones are actually needed by the init container.

cc @msau42

Jan 12 '24 15:01 andrewsykim

Disabling auto init container injection will be a quite big breaking change on our platform.

cc @daikeshi Would you mind sharing more details about this?

cc @andrewsykim is there any way to check if the head service has more than 0 endpoint from within a worker Pod without any RBAC? In KubeRay v1.1.0, the head Pod will always have a readiness probe to check the status of the GCS. Hence, if the head service has more than 0 endpoints (the head service should only have 0 or 1 endpoint.), it means that the GCS is ready. See #1674 for more details. If it is possible, the init container doesn't need to communicate with the head Pod.

Jan 12 '24 19:01 kevin85421

cc @daikeshi Would you mind sharing more details about this?

@kevin85421 Yeah, it's specific to our setup. Since we have a Python SDK to interact with KubeRay k8s API to create Ray cluster, if we disable the auto init container injection on the server side, the existing SDK and user's yaml will no longer work. They will either need to use our updated SDK or update their yaml file for Ray cluster creation.

Jan 13 '24 03:01 daikeshi

GKE GCS FUSE CSI team is working on adopting Kubernetes native sidecar container feature. We are targeting mid-March to make the feature available in GKE 1.29 clusters.

Feb 21 '24 17:02 songjiaxun

Let me clarify on the previous comment.

The root cause of this issue is that, GCS FUSE CSI driver currently does not support volumes for init container. This is because we run GCSFuse binary in a sidecar container, and for now the sidecar container is a regular container.

After the ~~mid-March~~(The new ETA date is 3/29/2024) GKE release, we will start to run the GCSFuse binary in a Kubernetes native sidecar container, which is also an init container. This means that, with the new release, you can mount a GCSFuse volume for your init containers. This new feature will fix this issue, and streamline the GCS data usage on KubeRay.

Mar 08 '24 18:03 songjiaxun

The new GKE version rollout completed. Starting from GKE 1.29.3-gke.1093000, the CSI driver injects the GCSFuse sidecar container as an init container that also supports mounting GCSFuse volume in other init containers.

To try out the new feature, please upgrade your GKE cluster to 1.29.3-gke.1093000 or later, and make sure ALL your nodes are also upgraded to GKE version 1.29 or later, then re-deploy your workloads.

Apr 07 '24 05:04 songjiaxun

@songjiaxun thanks for the update! I'll try it out.

Apr 16 '24 14:04 daikeshi

kuberay kuberay copied to clipboard

[Bug] GKE CSI Fuse Mounts prevent worker pod creation

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kuberay
kuberay copied to clipboard