ray icon indicating copy to clipboard operation
ray copied to clipboard

[Docker] (Kubeflow integration) Only UID=1000 has the write access of /home/ray in Ray images

Open kevin85421 opened this issue 3 years ago • 9 comments

Description

This issue is a part of the integration between KubeRay and Kubeflow. See https://github.com/ray-project/kuberay/pull/750#issuecomment-1329856912 for some context.

In KinD (a Kubernetes distribution) cluster, when we use the command kubectl exec .. to log in to a ray Pod, the UID will be the same as $RAY_UID (i.e. 1000) in base-deps/Dockerfile.

https://github.com/ray-project/ray/blob/b15d8f3f06598dd8b5dc77206c56e0b029006ddd/docker/base-deps/Dockerfile#L16-L27

For OpenShift (a Kubernetes distribution), a random non-root UID will be used when we log in to a ray Pod. However, only UID=1000 has the write access of /home/ray. Therefore, the error message of Permission denied will be reported. As follows, only ray (UID = 1000) has rwx (read, write, execute) access to /home/ray. Others only have r-x (read & execute) access to /home/ray.

> ls -l /home/
drwxr-xr-x 1 ray users 4096 Dec  7 17:18 ray

To reproduce it, we can follow instructions in pod-security.md, and add runAsUser and runAsGroup to the securityContext of ray-head in ray-cluster.pod-security.yaml.

 securityContext:
    runAsUser: 1001190000
    runAsGroup: 0
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault 

After the RayCluster is ready, use kubectl exec ... to log in to the head Pod. Next, execute the following commands.

(base) I have no name!@raycluster-pod-security-head-nlfmj:~$ id
uid=1001190000 gid=0(root) groups=0(root)
(base) I have no name!@raycluster-pod-security-head-nlfmj:~$ pwd
/home/ray
(base) I have no name!@raycluster-pod-security-head-nlfmj:~$ touch 123
touch: cannot touch '123': Permission denied

Use case

This is a part of the integration between KubeRay and Kubeflow. See https://github.com/ray-project/kuberay/pull/750 for some context.

kevin85421 avatar Dec 08 '22 01:12 kevin85421

cc @DmitriGekhtman @juliusvonkohout

kevin85421 avatar Dec 08 '22 01:12 kevin85421

@DmitriGekhtman can you help tag the owner of Dockerfiles? I cannot find the owners of ray/docker in CODEOWNERS. If the owners do not have time to work on this issue, I can work on it. Thank you!

kevin85421 avatar Dec 08 '22 01:12 kevin85421

The Dockerfiles are arcane work of @ijrsvt :D I imagine he might be a bit busy, but I'm sure he can answer any questions that come up.

DmitriGekhtman avatar Dec 08 '22 06:12 DmitriGekhtman

@kevin85421 What exactly do you want to change with the base Docker images?

ijrsvt avatar Dec 08 '22 17:12 ijrsvt

@kevin85421 What exactly do you want to change with the base Docker images?

Thank @ijrsvt for your reply! As mentioned in the PR description, we want to open the write access of /home/ray for other UIDs (!= 1000). I am not a security expert, so I am not sure whether this change will cause any negative impact on security or not.

kevin85421 avatar Dec 09 '22 00:12 kevin85421

Ahh! Totally missed that. Can we change the permissions to: 775 (User: RWX, Group: RWX, Others: R_X)?

ijrsvt avatar Dec 09 '22 16:12 ijrsvt

Ahh! Totally missed that. Can we change the permissions to: 775 (User: RWX, Group: RWX, Others: R_X)?

"During the creation of a project or namespace, OpenShift assigns a User ID (UID) range, a supplemental group ID (GID) range, and unique SELinux MCS labels to the project or namespace. ... When a Pod is deployed into the namespace, by default, OpenShift will use the first UID and first GID from this range to run the Pod. Any attempt by a Pod definition to specify a UID outside the assigned range will fail and requires special privileges." (A Guide to OpenShift and UIDs, RedHat)

775 will not work because we cannot assume OpenShift will choose RAY_GID (i.e. 100) as its GID. I think we should use 777, but I am not sure whether 777 will cause any security concern or not.

kevin85421 avatar Dec 10 '22 23:12 kevin85421

Amazing guys. i was busy with personal stuff but i will catch up again. write me on slack if you need help

juliusvonkohout avatar Dec 20 '22 17:12 juliusvonkohout

@kevin85421 @DmitriGekhtman

Openshift uses GID 0 by default so definitely not 100. Yes 777 is the right way to support all Kubernetes distributions. Everything in your container is executed as UID 1000 so far i guess. i do not see where the security problem is. For virtual machines with varying users and processes this this could be relevant, but not for properly written OCI containers with a single user. This is not only relevant for Openshift but also other enterprise security level clusters.

Is there anything else preventing you from moving forward?

juliusvonkohout avatar Jan 04 '23 12:01 juliusvonkohout

@kevin85421 @DmitriGekhtman

Openshift uses GID 0 by default so definitely not 100. Yes 777 is the right way to support all Kubernetes distributions. Everything in your container is executed as UID 1000 so far i guess. i do not see where the security problem is. For virtual machines with varying users and processes this this could be relevant, but not for properly written OCI containers with a single user. This is not only relevant for Openshift but also other enterprise security level clusters.

Is there anything else preventing you from moving forward?

Gentle ping @ijrsvt. Thank you!

kevin85421 avatar Jan 05 '23 15:01 kevin85421

@kevin85421 Yeah, I think we can modify it to be 777

ijrsvt avatar Jan 05 '23 19:01 ijrsvt

777 is too open for ssh, so #31563 may be reverted. See #32025 for more details. Any ideas for other solutions? cc @juliusvonkohout @ijrsvt

kevin85421 avatar Jan 28 '23 07:01 kevin85421

We decided to integrate Kubeflow without this update. (https://github.com/kubeflow/manifests/pull/2383)

kevin85421 avatar Feb 28 '23 19:02 kevin85421