kuberay
kuberay copied to clipboard
[Bug] GKE CSI Fuse Mounts prevent worker pod creation
Search before asking
- [X] I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
If one defines GKE CSI Fuse mounts to worker Pods, the init container never succeeds and finally fails with:
Error: context deadline exceeded
Warning Failed 12s kubelet Error: failed to reserve container name "wait-gcs-ready_search...7fb_0": name "wait-gcs-ready_search...7fb_0" is reserved for "df8...d3a"
GKE version: v1.26.3-gke.1000
Mount works in head node.
I believe the root cause is that the gke-gcsfuse-sidecar container is needed for mounting the CSI fuse volume. It never starts, because it waits "PodInitializing" state. Pod can't initialize because wait-gcs-ready gets a clone of worker's volumemounts, including the fuse mount and combination leads to deadlock. wait-gcs-ready eventually fails with CreateContainerError.
Reproduction script
Applying following resource reproduces the issue provided that CSI, bucket and serviceaccount is set up (instructions here: https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/cloud-storage-fuse-csi-driver)
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: fuserepro
spec:
rayVersion: '2.4.0'
headGroupSpec:
serviceType: ClusterIP
rayStartParams:
dashboard-host: '0.0.0.0'
block: 'true'
template:
metadata:
labels: {}
spec:
containers:
- name: ray-head
image: rayproject/ray:2.4.0
ports:
- containerPort: 6379
name: gcs
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: "2"
memory: "4G"
requests:
cpu: "2"
memory: "4G"
volumes:
- name: ray-logs
emptyDir: {}
workerGroupSpecs:
- replicas: 1
minReplicas: 1
maxReplicas: 1
groupName: cpu1
rayStartParams:
block: 'true'
template:
metadata:
annotations:
gke-gcsfuse/volumes: "true"
spec:
terminationGracePeriodSeconds: 60
serviceAccountName: sa-for-bucket-access
containers:
- name: ray-worker
image: rayproject/ray:2.4.0
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
volumeMounts:
- name: model-storage
mountPath: /fuse
readOnly: false
- mountPath: /tmp/ray
name: ray-logs
resources:
limits:
cpu: 1
memory: "2G"
requests:
cpu: 1
memory: "2G"
initContainers:
- name: init
image: busybox:1.28
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for K8s Service $RAY_IP; sleep 2; done"]
volumes:
- name: model-storage
csi:
driver: gcsfuse.csi.storage.gke.io
readOnly: false
volumeAttributes:
bucketName: gcs-bucket
- name: ray-logs
emptyDir: {}
Anything else
KubeRay justifies volume copying to initcontainer and wait-gcs-ready as follows. Not sure what would be clean solution to avoid referencing to certain volumes during initialize?
// This init container requires certain environment variables to establish a secure connection with the Ray head using TLS authentication.
// Additionally, some of these environment variables may reference files stored in volumes, so we need to include both the `Env` and `VolumeMounts` fields here.
// For more details, please refer to: https://docs.ray.io/en/latest/ray-core/configure.html#tls-authentication.
Env: podTemplate.Spec.Containers[rayContainerIndex].DeepCopy().Env,
VolumeMounts: podTemplate.Spec.Containers[rayContainerIndex].DeepCopy().VolumeMounts
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
I've just hit a similar problem with the wait-gcs-ready
container injection. Does this really have to be done in a separate container? There's already logic in BuildPod()
doing different things depending on rayNodeType
, why not just set the command in there for a worker to be:
until ray health-check --address ${RAY_IP}:${RAY_PORT} > /dev/null 2>&1; do echo wait for GCS to be ready; sleep 5; done && ray start ...
You can check https://github.com/ray-project/kuberay/pull/1069 for a workaround.
Basically you can set ENABLE_INIT_CONTAINER_INJECTION
to false to avoid having the default init container and set your own init container.
Reopen this issue. I will check whether we should make the GKE CSI Fuse work with the default KubeRay config, or if updating the documentation is sufficient.
@kevin85421 I'm curious about the status of this issue. We hit the same issue and wonder if it can work with the default KbueRay config. Disabling auto init container injection will be a quite big breaking change on our platform. I wonder if we can approach it with a more elegant solution. Thank you.
Is the primary reason for volume mounting to read TLS ceritificates? Copying all volume mounts does seem unecessary, but I'm not sure how we would check which ones are actually needed by the init container.
cc @msau42
Disabling auto init container injection will be a quite big breaking change on our platform.
cc @daikeshi Would you mind sharing more details about this?
cc @andrewsykim is there any way to check if the head service has more than 0 endpoint from within a worker Pod without any RBAC? In KubeRay v1.1.0, the head Pod will always have a readiness probe to check the status of the GCS. Hence, if the head service has more than 0 endpoints (the head service should only have 0 or 1 endpoint.), it means that the GCS is ready. See #1674 for more details. If it is possible, the init container doesn't need to communicate with the head Pod.
cc @daikeshi Would you mind sharing more details about this?
@kevin85421 Yeah, it's specific to our setup. Since we have a Python SDK to interact with KubeRay k8s API to create Ray cluster, if we disable the auto init container injection on the server side, the existing SDK and user's yaml will no longer work. They will either need to use our updated SDK or update their yaml file for Ray cluster creation.
GKE GCS FUSE CSI team is working on adopting Kubernetes native sidecar container feature. We are targeting mid-March to make the feature available in GKE 1.29 clusters.
Let me clarify on the previous comment.
The root cause of this issue is that, GCS FUSE CSI driver currently does not support volumes for init container. This is because we run GCSFuse binary in a sidecar container, and for now the sidecar container is a regular container.
After the ~~mid-March~~(The new ETA date is 3/29/2024) GKE release, we will start to run the GCSFuse binary in a Kubernetes native sidecar container, which is also an init container. This means that, with the new release, you can mount a GCSFuse volume for your init containers. This new feature will fix this issue, and streamline the GCS data usage on KubeRay.
The new GKE version rollout completed. Starting from GKE 1.29.3-gke.1093000
, the CSI driver injects the GCSFuse sidecar container as an init container that also supports mounting GCSFuse volume in other init containers.
To try out the new feature, please upgrade your GKE cluster to 1.29.3-gke.1093000
or later, and make sure ALL your nodes are also upgraded to GKE version 1.29 or later, then re-deploy your workloads.
@songjiaxun thanks for the update! I'll try it out.