singularity Occasional failure to terminate fakeroot tempdir rm in e2e tests

Singularity 3.7.0

Within the e2e tests we have experienced occassional failures in termination of a singularity exec call that uses --fakeroot. This is most often happening in the ImageVerify() function call from TestE2E/PAR/BUILD/from_local_image/Fakeroot/* tests.

I have not yet been able to replicate this issue outside of e2e test runs.

Most e2e runs will not get stuck. Some runs will get stuck at different places within ImageVerify() within different TestE2E/PAR/BUILD/from_local_image/Fakeroot/* sub-tests. However - the presentation is always the same, as below:

Symptoms

The process tree with the stuck process is always similar to:

        ├─sshd─┬─sshd───sshd───sh───python3───make───go-test───go─┬─sudo───e2e.test───e2e.test─┬─2*[cat]
        │      │                                                  │                            ├─8*[gpg-agent]
        │      │                                                  │                            ├─22*[gpg2]
        │      │                                                  │                            ├─6*[gpgconf]
        │      │                                                  │                            ├─2*[gpgsm]
        │      │                                                  │                            ├─2*[sinit]
        │      │                                                  │                            ├─starter─┬─sinit─┬─registry───7*[{registry}]
        │      │                                                  │                            │         │       └─7*[{sinit}]
        │      │                                                  │                            │         └─6*[{starter}]
        │      │                                                  │                            ├─54*[starter]
        │      │                                                  │                            ├─45*[starter-suid]
        │      │                                                  │                            ├─starter-suid─┬─starter-suid─┬─rm
        │      │                                                  │                            │              │              └─6*[{starter-suid}]
        │      │                                                  │                            │              └─7*[{starter-suid}]
        │      │                                                  │                            └─8*[{e2e.test}]
        │      │                                                  └─7*[{go}]

The ├─starter-suid─┬─starter-suid─┬─rm is the stuck chain. This is rm was executed via the fakeroot engine to cleanup the temporary container directory after the exec'd command has completed successfully.

The rm process is always defunct, while the starter-suid processes are stuck on a FUTEX_WAIT_PRIVATE.

ec2-user 15392  0.0  0.4 1237708 18592 pts/0   Sl+  15:58   0:00 Singularity runtime parent
ec2-user 15533  0.0  0.3 1162100 14388 pts/0   Sl+  15:58   0:00 Singularity fakeroot

ec2-user@ip-172-31-39-218:~> sudo strace -p 15392
strace: Process 15392 attached
futex(0xaaaaabdc2d48, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 15392 detached
 <detached ...>

ec2-user@ip-172-31-39-218:~> sudo strace -p 15533
strace: Process 15533 attached
futex(0xaaaacaaced48, FUTEX_WAIT_PRIVATE, 0, NULL^Cstrace: Process 15533 detached
 <detached ...>

This issue has been observed across RHEL 7 / SLES 12 / SLES 15 / Ubuntu 1804. It has not (yet) been observed on Ubuntu 2004 or Fedora 33 environments, perhaps hinting at a cause that is perhaps mitigated in the newest kernels?

Investigation / Thoughts

When Singularity runs a container in --fakeroot mode, loopback squashfs mounts are not possible so any SIF image is extracted to a temporary directory before execution. Singularity sets up mounts, namespaces, and executes the container. There are typically a large number of underlay bind mounts from the temporary directory, to the session directory.

When the run/exec/shell is complete, Singularity must remove the temporary directory. In fakeroot mode it is possible that this directory contains files with ownership/perms that cannot be removed by a standard user-process due to uid/gid mapping etc.

Singularity calls /bin/rm -rf on the temporary directory via a fakeroot invocation through starter-suid, in order that it can clean up any such files.

https://github.com/hpcng/singularity/blob/f806a5ed22627819d52846b21aa1ead86cc3199d/internal/pkg/runtime/engine/singularity/cleanup_linux.go#L59

At the point this /bin/rm -rf runs, there are still bind mounts from the temporary directory onto the session directory etc. An explicit unmount of the container dirs is only done in the case an image driver is in use:

https://github.com/hpcng/singularity/blob/f806a5ed22627819d52846b21aa1ead86cc3199d/internal/pkg/runtime/engine/singularity/cleanup_linux.go#L44

My initial thought was that the hang might be related to issues with the directory being removed under these bind mounts / involvement of various mount namespaces etc. If I run the imgbuild tests only in a tight loop I have not yet seen failures, suggesting the issue is triggered by something happening in other prior / concurrent e2e tets.

I inserted code to MNT_DETACH the container rootfs etc. before the /bin/rm is run - so that where the /bin/rm is run there are no bind mounts from the temporary directory that is being removed, in the fakeroot context for that rm. This at least seems to make the problem less frequent, but it still occurs. A limitation with this idea is that MNT_DETACH doesn't guarantee the mount is no longer in use / blocking things, but just removes it from the hierarchy for full disposal when no references exist.

Still looking into this - debugging is difficult as:

It is an occasional failure
I can't reproduce it outside of the e2e runs... so very difficult to strace or attach a debugger ahead of the point things get stuck
The nested fakeroot invocation of /bin/rm does not write debug logs without code changes

Any thoughts from @cclerget @tri-adam @ikaneshiro would be most welcome.... I'm going to continue to pull at threads on this one.

Dec 28 '20 22:12 dtrudg

Hello,

This is a templated response that is being sent out to all open issues. We are working hard on 'rebuilding' the Singularity community, and a major task on the agenda is finding out what issues are still outstanding.

Please consider the following:

Is this issue a duplicate, or has it been fixed/implemented since being added?
Is the issue still relevant to the current state of Singularity's functionality?
Would you like to continue discussing this issue or feature request?

Thanks, Carter

May 15 '21 16:05 carterpeel

This issue has been automatically marked as stale because it has not had activity in over 60 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

Jul 14 '21 23:07 stale[bot]

This issue has been automatically closed because no response was provided within 7 days.

Jul 21 '21 23:07 stale[bot]

We are experiencing this issue very often on CentOS 8 and Singularity from the OpenHPC repository (3.7.1-5.1.ohpc.2.1). We have even tried to run Singularity inside tini in the hope it would reap the defunct processes, but it didn't work.

26578  3081078       1  0 08:10 ?        00:00:00 bash /home/u726578/chaotic/toolbox/src/chaotic routine hourly.1
u726578  3081696 3081078  0 08:10 ?        00:00:00  \_ bash /home/u726578/chaotic/toolbox/src/chaotic routine hourly.1
u726578  3081697 3081696  0 08:10 ?        00:00:00      \_ bash /home/u726578/chaotic/toolbox/src/chaotic routine hourly.1
u726578  3081704 3081697  0 08:10 ?        00:00:00      |   \_ tini -s -- singularity --silent exec --fakeroot -B /tmp/chaotic/routines/hourl
y.1/davinci-resolve-studio:/what-is-mine docker://registry.gitlab.com/jitesoft/dockerfiles/alpine chown -R 0:0 /what-is-mine
u726578  3081705 3081704  0 08:10 ?        00:00:00      |       \_ Singularity runtime parent
u726578  3082275 3081705  0 08:10 ?        00:00:00      |           \_ Singularity fakeroot
u726578  3082323 3082275  0 08:10 ?        00:00:00      |               \_ [rm] <defunct>
u726578  3081698 3081696  0 08:10 ?        00:00:00      \_ tee -a davinci-resolve-studio.log
u726578  3106858       1  0 08:11 ?        00:00:00 bash /home/u726578/chaotic/toolbox/src/chaotic routine hourly.1
u726578  3107064 3106858  0 08:11 ?        00:00:00  \_ bash /home/u726578/chaotic/toolbox/src/chaotic routine hourly.1
u726578  3107065 3107064  0 08:11 ?        00:00:00      \_ bash /home/u726578/chaotic/toolbox/src/chaotic routine hourly.1
u726578  3107071 3107065  0 08:11 ?        00:00:00      |   \_ tini -s -- singularity --silent exec --fakeroot -B /tmp/chaotic/routines/hourly.1/tok-git:/what-is-mine docker://registry.gitlab.com/jitesoft/dockerfiles/alpine chown -R 0:0 /what-is-mine
u726578  3107072 3107071  0 08:11 ?        00:00:00      |       \_ Singularity runtime parent
u726578  3107328 3107072  0 08:11 ?        00:00:00      |           \_ Singularity fakeroot
u726578  3107344 3107328  0 08:11 ?        00:00:00      |               \_ [rm] <defunct>
u726578  3107066 3107064  0 08:11 ?        00:00:00      \_ tee -a tok-git.log

Nov 09 '21 11:11 thotypous

@thotypous Can you provide a simple recipe that would enable anyone to reproduce it on CentOS 8? Does it happen with EPEL singularity?

Nov 09 '21 15:11 DrDaveD

@thotypous Can you provide a simple recipe that would enable anyone to reproduce it on CentOS 8? Does it happen with EPEL singularity?

I don't have a reduced testcase, but it likely involves lots of concurrency. The application where this issue was showing up is a package builder. We unpack a clean rootfs to a directory, add a script containing a recipe on how to build some package, then start that sandbox directory as a singularity container (with fakeroot). We build 10 packages in parallel (in separate containers).

After some package is built, we need to collect the package file and remove the sandbox directory, but since it was ran inside a fakeroot, it contains files owned by our user's subuid/subgid which we don't have permission to access. Thus what we used to do was to start an Alpine container (from a Docker repository, which Singularity internally caches as squashfs) with fakeroot to recursively chown the files to our user.

After building around 1k packages, we can see 1~3 processes stuck with a defunct rm. This is clearly Singularity trying to remove the temporary sandbox it creates in order to run the squashfs image within a fakeroot.

We ended up solving this by using podman unshare to run chown, so that we don't need to rely on a squashfs-based Singularity container. It would be nice if Singularity provided a native equivalent to the podman unshare command (something similar is also provided by LXC, through lxc-usernsexec), but that's a feature request for another issue.

I can rollback that change and try to reproduce the issue with EPEL singularity when I get some time, but it won't be so simple as we don't have a separate Slurm cluster to run that kind of breaking change.

Nov 10 '21 15:11 thotypous

Update to my original note...

This continues to occur occasionally for me in the e2e tests when using SLES 15 on a high core count machine. It is infrequent there, and is seen even more rarely on EL 7 / SLES 12 / Ubuntu 18.04. I have still not seen it myself under EL8 or Ubuntu 20.04. I have not seen it on ARM64 - only AMD64.

Nov 10 '21 16:11 dtrudg

Copied to the new Apptainer issue list. https://github.com/apptainer/apptainer/issues/1155

Mar 06 '23 00:03 kmuriki