Permission denied when trying to do fchdir to sandbox image in setuid-root mode
Version of Apptainer
apptainer version 1.1.8-1.el7
apptainer-suid-1.1.8-1.el7 is also installed
Expected behavior
Expect to be able to run container.
Actual behavior
Failed with error
ERROR : Failed to change current working directory: Permission denied
Steps to reproduce this behavior
We have only seen this happening when using mode 1 of the cvmfsexec. That requires the fuse package to be installed to make fusermount available.
In a scratch directory on a local disk, preferably on an el7 machine, do the following steps:
- git clone https://github.com/cvmfs/cvmfsexec.git
- cd cvmfsexec
- ./makedist default
- ./mountrepo cvmfs-config.cern.ch
- ./mountrepo atlas.cern.ch
- apptainer shell $PWD/dist/cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7
That results in
ERROR : Failed to change current working directory: Permission denied
This failure doesn't when unprivileged user namespaces are available and running with --userns (or no apptainer-suid).
In order to clean up the mounts, do
- ./umountrepo atlas.cern.ch
- ./umountrepo cvmfs-config.cern.ch
What OS/distro are you running
CentOS 7
How did you install Apptainer
From EPEL.
When run with debug mode it ends with:
VERBOSE [U=3382,P=2069950] wait_child() stage 1 exited with status 0
DEBUG [U=3382,P=2069950] init() Applying stage 1 working directory
ERROR [U=3382,P=2069950] init() Failed to change current working directory: Permission denied
I ran starter-suid with strace -f -s1024 and here's the end of the strace on the failing process:
2069950 open("/proc/self/fd", O_RDONLY) = 3
2069950 fstat(3, {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
2069950 fcntl(3, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
2069950 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
2069950 getdents(3, /* 6 entries */, 32768) = 144
2069950 getdents(3, /* 0 entries */, 32768) = 0
2069950 lseek(3, 0, SEEK_SET) = 0
2069950 getdents(3, /* 6 entries */, 32768) = 144
2069950 getdents(3, /* 0 entries */, 32768) = 0
2069950 close(3) = 0
2069950 close(3) = -1 EBADF (Bad file descriptor)
2069950 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
2069950 clone(child_stack=0x555d7e8064d0, flags=CLONE_FILES|SIGCHLD) = 2069951
2069950 geteuid() = 3382
2069950 geteuid( <unfinished ...>
2069950 <... geteuid resumed>) = 3382
2069950 write(2, "DEBUG [U=3382,P=2069950] init() Wait completion of stage1\n", 84 <unfinished ...>
2069950 <... write resumed>) = 84
2069950 wait4(2069951, <unfinished ...>
2069950 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 2069951
2069950 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2069951, si_uid=3382, si_status=0, si_utime=1, si_stime=2} ---
2069950 geteuid() = 3382
2069950 geteuid() = 3382
2069950 write(2, "VERBOSE [U=3382,P=2069950] wait_child() stage 1 exited with status 0\n", 87) = 87
2069950 geteuid() = 3382
2069950 geteuid() = 3382
2069950 write(2, "DEBUG [U=3382,P=2069950] init() Applying stage 1 working directory\n", 93) = 93
2069950 fchdir(3) = -1 EACCES (Permission denied)
2069950 geteuid() = 3382
2069950 geteuid() = 3382
2069950 write(2, "\33[91mERROR [U=3382,P=2069950] init() Failed to change current working directory: Permission denied\n\33[0m", 129) = 129
2069950 exit_group(1) = ?
2069950 +++ exited with 1 +++
This is failing in starter.c and using file descriptor 3 even though the file descriptor was closed a little earlier. I see the file descriptor being set in prepare_linux.go but I don't understand how it is supposed to be shared with the C code.
@DrDaveD those lines
2069950 open("/proc/self/fd", O_RDONLY) = 3
2069950 fstat(3, {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
2069950 fcntl(3, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE)
2069950 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
2069950 getdents(3, /* 6 entries */, 32768) = 144
2069950 getdents(3, /* 0 entries */, 32768) = 0
2069950 lseek(3, 0, SEEK_SET) = 0
2069950 getdents(3, /* 6 entries */, 32768) = 144
2069950 getdents(3, /* 0 entries */, 32768) = 0
2069950 close(3) = 0
2069950 close(3) = -1 EBADF (Bad file descriptor)
are corresponding to call to list_fd prior to spawning stage 1, if fd 3 were closed, fchdir would return EBADF error.
C code and Stage 1 Go code share the file descriptor table thanks to CLONE_FILES, this is equivalent of doing that with threads but between two processes.
My guess would be that the FUSE mount point doesn't have allow_other option set to permit apptainer suid to access the mount point and get a permission denied when using $PWD/dist/cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 as current directory. Maybe we have to also adjust FS uid/gid with setfsuid/setfsgid calls
Thanks! You are correct: if I add user_allow_other to /etc/fuse.conf and add ,allow_other to the cvmfs2 -o list inside of mountrepo, it does not get Permission denied.
Is there anything that apptainer could do to avoid this without setting allow_other? Is it running as root at the time, and might it work to drop privileges before doing fchdir?
Oh but the strace shows that the effective uid is the user. That's probably what you meant about trying setfsuid/setfsgid. Might that make it work without setting allow_other?
In the suid case, privilege are dropped and fsuid/fsgid should be set to current user as euid is corresponding to the current user, so this is strange, you also mentioned:
We have only seen this happening when using mode 1
What does it mean technically ? Is there a user namespace involved ? Also can you access the FUSE mount point without apptainer ?
No user namespace is involved in cvmfsexec mode 1. It just uses fusermount. Yes the FUSE mount point can be accessed on the host, by any process of the person who did the mount.