apptainer icon indicating copy to clipboard operation
apptainer copied to clipboard

Permission denied when trying to do fchdir to sandbox image in setuid-root mode

Open DrDaveD opened this issue 2 years ago • 6 comments

Version of Apptainer

apptainer version 1.1.8-1.el7

apptainer-suid-1.1.8-1.el7 is also installed

Expected behavior

Expect to be able to run container.

Actual behavior

Failed with error

ERROR  : Failed to change current working directory: Permission denied

Steps to reproduce this behavior

We have only seen this happening when using mode 1 of the cvmfsexec. That requires the fuse package to be installed to make fusermount available.

In a scratch directory on a local disk, preferably on an el7 machine, do the following steps:

  1. git clone https://github.com/cvmfs/cvmfsexec.git
  2. cd cvmfsexec
  3. ./makedist default
  4. ./mountrepo cvmfs-config.cern.ch
  5. ./mountrepo atlas.cern.ch
  6. apptainer shell $PWD/dist/cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7

That results in

ERROR  : Failed to change current working directory: Permission denied

This failure doesn't when unprivileged user namespaces are available and running with --userns (or no apptainer-suid).

In order to clean up the mounts, do

  1. ./umountrepo atlas.cern.ch
  2. ./umountrepo cvmfs-config.cern.ch

What OS/distro are you running

CentOS 7

How did you install Apptainer

From EPEL.

DrDaveD avatar May 16 '23 16:05 DrDaveD

When run with debug mode it ends with:

VERBOSE [U=3382,P=2069950]  wait_child()                  stage 1 exited with status 0
DEBUG   [U=3382,P=2069950]  init()                        Applying stage 1 working directory
ERROR   [U=3382,P=2069950]  init()                        Failed to change current working directory: Permission denied

I ran starter-suid with strace -f -s1024 and here's the end of the strace on the failing process:

2069950 open("/proc/self/fd", O_RDONLY) = 3
2069950 fstat(3, {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
2069950 fcntl(3, F_GETFL)               = 0x8000 (flags O_RDONLY|O_LARGEFILE)
2069950 fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
2069950 getdents(3, /* 6 entries */, 32768) = 144
2069950 getdents(3, /* 0 entries */, 32768) = 0
2069950 lseek(3, 0, SEEK_SET)           = 0
2069950 getdents(3, /* 6 entries */, 32768) = 144
2069950 getdents(3, /* 0 entries */, 32768) = 0
2069950 close(3)                        = 0
2069950 close(3)                        = -1 EBADF (Bad file descriptor)
2069950 rt_sigprocmask(SIG_BLOCK, NULL, [], 8) = 0
2069950 clone(child_stack=0x555d7e8064d0, flags=CLONE_FILES|SIGCHLD) = 2069951
2069950 geteuid()                       = 3382
2069950 geteuid( <unfinished ...>
2069950 <... geteuid resumed>)          = 3382
2069950 write(2, "DEBUG   [U=3382,P=2069950]  init()                        Wait completion of stage1\n", 84 <unfinished ...>
2069950 <... write resumed>)            = 84
2069950 wait4(2069951,  <unfinished ...>
2069950 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 2069951
2069950 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2069951, si_uid=3382, si_status=0, si_utime=1, si_stime=2} ---
2069950 geteuid()                       = 3382
2069950 geteuid()                       = 3382
2069950 write(2, "VERBOSE [U=3382,P=2069950]  wait_child()                  stage 1 exited with status 0\n", 87) = 87
2069950 geteuid()                       = 3382
2069950 geteuid()                       = 3382
2069950 write(2, "DEBUG   [U=3382,P=2069950]  init()                        Applying stage 1 working directory\n", 93) = 93
2069950 fchdir(3)                       = -1 EACCES (Permission denied)
2069950 geteuid()                       = 3382
2069950 geteuid()                       = 3382
2069950 write(2, "\33[91mERROR   [U=3382,P=2069950]  init()                        Failed to change current working directory: Permission denied\n\33[0m", 129) = 129
2069950 exit_group(1)                   = ?
2069950 +++ exited with 1 +++

This is failing in starter.c and using file descriptor 3 even though the file descriptor was closed a little earlier. I see the file descriptor being set in prepare_linux.go but I don't understand how it is supposed to be shared with the C code.

DrDaveD avatar May 16 '23 16:05 DrDaveD

@DrDaveD those lines

2069950 open("/proc/self/fd", O_RDONLY) = 3
2069950 fstat(3, {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
2069950 fcntl(3, F_GETFL)               = 0x8000 (flags O_RDONLY|O_LARGEFILE)
2069950 fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
2069950 getdents(3, /* 6 entries */, 32768) = 144
2069950 getdents(3, /* 0 entries */, 32768) = 0
2069950 lseek(3, 0, SEEK_SET)           = 0
2069950 getdents(3, /* 6 entries */, 32768) = 144
2069950 getdents(3, /* 0 entries */, 32768) = 0
2069950 close(3)                        = 0
2069950 close(3)                        = -1 EBADF (Bad file descriptor)

are corresponding to call to list_fd prior to spawning stage 1, if fd 3 were closed, fchdir would return EBADF error. C code and Stage 1 Go code share the file descriptor table thanks to CLONE_FILES, this is equivalent of doing that with threads but between two processes.

My guess would be that the FUSE mount point doesn't have allow_other option set to permit apptainer suid to access the mount point and get a permission denied when using $PWD/dist/cvmfs/atlas.cern.ch/repo/containers/fs/singularity/x86_64-centos7 as current directory. Maybe we have to also adjust FS uid/gid with setfsuid/setfsgid calls

cclerget avatar May 16 '23 17:05 cclerget

Thanks! You are correct: if I add user_allow_other to /etc/fuse.conf and add ,allow_other to the cvmfs2 -o list inside of mountrepo, it does not get Permission denied.

Is there anything that apptainer could do to avoid this without setting allow_other? Is it running as root at the time, and might it work to drop privileges before doing fchdir?

DrDaveD avatar May 16 '23 18:05 DrDaveD

Oh but the strace shows that the effective uid is the user. That's probably what you meant about trying setfsuid/setfsgid. Might that make it work without setting allow_other?

DrDaveD avatar May 16 '23 19:05 DrDaveD

In the suid case, privilege are dropped and fsuid/fsgid should be set to current user as euid is corresponding to the current user, so this is strange, you also mentioned:

We have only seen this happening when using mode 1

What does it mean technically ? Is there a user namespace involved ? Also can you access the FUSE mount point without apptainer ?

cclerget avatar May 17 '23 07:05 cclerget

No user namespace is involved in cvmfsexec mode 1. It just uses fusermount. Yes the FUSE mount point can be accessed on the host, by any process of the person who did the mount.

DrDaveD avatar May 17 '23 15:05 DrDaveD