charliecloud icon indicating copy to clipboard operation
charliecloud copied to clipboard

ch-fromhost --nvidia, security.selinux fails, 0.24?

Open hpcpony opened this issue 3 years ago • 5 comments

charliecloud 0.24, anaconda3 2021.05 (python 3.8.8), CentOS 7.6. (non-root user).

I haven't found a comprehensive guide for working with NVIDIA GPUs and charliecloud but I've managed to piece together enough to get to here. This one has me stumped. Any idea what I've missed?

[hpcpony@ugpu1 hello]$ ch-fromhost --nvidia /scr/hpcpony/IMG/img/hello /sbin/ldconfig: Can't stat /libx32: No such file or directory /sbin/ldconfig: Can't stat /usr/libx32: No such file or directory cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted cp: setting attribute ‘security.selinux’ for ‘security.selinux’: Operation not permitted

Not sure what else is important, but ....

[hpcpony@ugpu1 hello]$ nvidia-container-cli list /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools /dev/nvidia-modeset /dev/nvidia0 /dev/nvidia1 /dev/nvidia2 /dev/nvidia3 /usr/bin/nvidia-smi /usr/bin/nvidia-debugdump /usr/bin/nvidia-persistenced /usr/bin/nvidia-cuda-mps-control /usr/bin/nvidia-cuda-mps-server /usr/lib64/libnvidia-ml.so.450.80.02 /usr/lib64/libnvidia-cfg.so.450.80.02 /usr/lib64/libcuda.so.450.80.02 /usr/lib64/libnvidia-opencl.so.450.80.02 /usr/lib64/libnvidia-ptxjitcompiler.so.450.80.02 /usr/lib64/libnvidia-allocator.so.450.80.02 /usr/lib64/libnvidia-compiler.so.450.80.02 /usr/lib64/vdpau/libvdpau_nvidia.so.450.80.02 /usr/lib64/libnvidia-encode.so.450.80.02 /usr/lib64/libnvidia-opticalflow.so.450.80.02 /usr/lib64/libnvcuvid.so.450.80.02 /usr/lib64/libnvidia-fbc.so.450.80.02 /usr/lib64/libnvidia-ifr.so.450.80.02 /usr/lib/libnvidia-ml.so.450.80.02 /usr/lib/libcuda.so.450.80.02 /usr/lib/libnvidia-opencl.so.450.80.02 /usr/lib/libnvidia-ptxjitcompiler.so.450.80.02 /usr/lib/libnvidia-allocator.so.450.80.02 /usr/lib/libnvidia-compiler.so.450.80.02 /usr/lib/vdpau/libvdpau_nvidia.so.450.80.02 /usr/lib/libnvidia-encode.so.450.80.02 /usr/lib/libnvidia-opticalflow.so.450.80.02 /usr/lib/libnvcuvid.so.450.80.02 /usr/lib/libnvidia-fbc.so.450.80.02 /usr/lib/libnvidia-ifr.so.450.80.02

My Dockerfile that built the image is just a toy:

FROM centos:8 COPY . hello RUN hello/hello.sh

hpcpony avatar Jul 13 '21 16:07 hpcpony

Hi, thanks for the report! I don't know much about SELinux either but can you answer a couple questions to get us started:

  1. What is your SELinux configuration?
  2. What are the xattrs on the files listed by nvidia-container-cli list?

FWIW if you are able to change the SELinux enforcement policy to just warn instead of enforce, it will probably work I htink.

reidpr avatar Jul 13 '21 17:07 reidpr

Inside the container I don't see any SELinux commands or things that suggest SELinux running (but then I know very little about SELinux):

sh-4.4$ rpm -aq | grep -i selinux libselinux-2.9-4.el8_3.x86_64 sh-4.4$ rpm -ql libselinux-2.9-4.el8_3.x86_64 /run/setrans /usr/lib/.build-id /usr/lib/.build-id/f5 /usr/lib/.build-id/f5/9797148b51da3c41370527218252e673630c23 /usr/lib/tmpfiles.d/libselinux.conf /usr/lib64/libselinux.so.1 /usr/share/licenses/libselinux /usr/share/licenses/libselinux/LICENSE sh-4.4$ cat /usr/lib/tmpfiles.d/libselinux.conf d /run/setrans 0755 root root

On the host.

[hpcpony@ugpu1 hello]$ sestatus SELinux status: disabled

Not sure if this is what you want.. (If not be more specific and I'll check it):

[hpcpony@ugpu1 hello]$ foreach f (nvidia-container-cli list) foreach? lsattr $f foreach? end lsattr: Operation not supported While reading flags on /dev/nvidiactl lsattr: Operation not supported While reading flags on /dev/nvidia-uvm lsattr: Operation not supported While reading flags on /dev/nvidia-uvm-tools lsattr: No such file or directory while trying to stat /dev/nvidia-modeset lsattr: Operation not supported While reading flags on /dev/nvidia0 lsattr: Operation not supported While reading flags on /dev/nvidia1 lsattr: Operation not supported While reading flags on /dev/nvidia2 lsattr: Operation not supported While reading flags on /dev/nvidia3 ---------------- /usr/bin/nvidia-smi ---------------- /usr/bin/nvidia-debugdump ---------------- /usr/bin/nvidia-persistenced ---------------- /usr/bin/nvidia-cuda-mps-control ---------------- /usr/bin/nvidia-cuda-mps-server ---------------- /usr/lib64/libnvidia-ml.so.450.80.02 ---------------- /usr/lib64/libnvidia-cfg.so.450.80.02 ---------------- /usr/lib64/libcuda.so.450.80.02 ---------------- /usr/lib64/libnvidia-opencl.so.450.80.02 ---------------- /usr/lib64/libnvidia-ptxjitcompiler.so.450.80.02 ---------------- /usr/lib64/libnvidia-allocator.so.450.80.02 ---------------- /usr/lib64/libnvidia-compiler.so.450.80.02 ---------------- /usr/lib64/vdpau/libvdpau_nvidia.so.450.80.02 ---------------- /usr/lib64/libnvidia-encode.so.450.80.02 ---------------- /usr/lib64/libnvidia-opticalflow.so.450.80.02 ---------------- /usr/lib64/libnvcuvid.so.450.80.02 ---------------- /usr/lib64/libnvidia-fbc.so.450.80.02 ---------------- /usr/lib64/libnvidia-ifr.so.450.80.02 ---------------- /usr/lib/libnvidia-ml.so.450.80.02 ---------------- /usr/lib/libcuda.so.450.80.02 ---------------- /usr/lib/libnvidia-opencl.so.450.80.02 ---------------- /usr/lib/libnvidia-ptxjitcompiler.so.450.80.02 ---------------- /usr/lib/libnvidia-allocator.so.450.80.02 ---------------- /usr/lib/libnvidia-compiler.so.450.80.02 ---------------- /usr/lib/vdpau/libvdpau_nvidia.so.450.80.02 ---------------- /usr/lib/libnvidia-encode.so.450.80.02 ---------------- /usr/lib/libnvidia-opticalflow.so.450.80.02 ---------------- /usr/lib/libnvcuvid.so.450.80.02 ---------------- /usr/lib/libnvidia-fbc.so.450.80.02 ---------------- /usr/lib/libnvidia-ifr.so.450.80.02

hpcpony avatar Jul 13 '21 17:07 hpcpony

I've been poking around....

It appears if I modify ch-fromhost:

    if [ ! -w "${image}${d}" ]; then
        # Some images unpack with unwriteable directories; fix. This seems
        # like a bit of a kludge to me, so I'd like to remove this special
        # case in the future if possible. (#323)
        info "${image}${d} not writeable; fixing"
        chmod u+w "${image}${d}" || fatal "can't chmod u+w: ${image}${d}"
    fi
       cp --dereference --preserve=all "$f" "${image}${d}" \
    || fatal "cannot inject: ${f}"

and modify the "cp" line...

       cp --dereference --preserve=all --no-preserve=xattr "$f" "${image}${d}" \

... the security.selinux errors go away. I do not know whether this is a good fix to this problem or not.

The "libx32" errors persist.

hpcpony avatar Jul 13 '21 18:07 hpcpony

The "libx32" errors persist.

These are harmless, fortunately; see issue #732.

reidpr avatar Jul 13 '21 18:07 reidpr

It seems like ignoring the xattrs when performing the copy may be a path forward on this. Looking into this I found this podman issue in which such errors are ignored in the underlying calls to shutil.move. @mej do you have any input on this?

As a note, in our docs we say that we ignore SELinux.

heasterday avatar Aug 17 '21 22:08 heasterday