singularity
singularity copied to clipboard
No Loop Devices Error when using a large number of MPI processes
Version of Singularity:
3.7.2-3.el7
Problem Setup
I am attempting to run a singularity container with a large number of MPI processes (56 per node) on the Frontera supercomputer. This problem is similar or identical to #3928 , which was marked resolved in v3.5 with PR #4549.
The command I'm using to run the job is:
ibrun singularity exec --bind $SCRATCH:$SCRATCH <singularity image> python <python script>
On the system, SHARED LOOP DEVICES is enabled and MAX LOOP DEVICES is set to 512.
Expected behavior
I expect MPI to launch 56 processes running singularity.
Actual behavior
I get the following error:
FATAL: container creation failed: mount /proc/self/fd/6->/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/6: failed to find loop device: could not attach image file to loop device: no loop devices available
This error does not seem to happen all the time and happens less with fewer processes.
Steps to reproduce this behavior
Run singularity through a large number of MPI processes.
What OS/distro are you running
CentOS Linux 7
I assigned this to Cedric, but he's still on vacation so it could take a while. You might want to also report it to sylabs/singularity.
Thanks! I'll report it on the sylabs repo as well.
Hi @georgiastuart,
Can you post the --debug
output of this issue please?
Thank you! Greg
Here's the debug info! https://gist.github.com/georgiastuart/3172483873ab002728497e0f8d4e721f
Hi @georgiastuart, thank you.
I don't think this caught all of the debug output as it didn't even get to the loop device setups. If there is just too much output, please grep through the log output for loop
and forward just those results.
Also, if you can count how many loop devices are already setup with the command ls -l /dev/loop* | wc -l
@gmkurtzer sorry about that! It looks like the gist was too long to display, but if you view the raw file you can see the loop device lines. Regardless, I've isolated the loop lines here.
ls -l /dev/loop* | wc -l
returns 1.
@georgiastuart What's the value returned by cat /sys/module/loop/parameters/max_loop
on those nodes ?
@cclerget sorry for the delay. That file does not exist on Frontera. I contacted support and confirmed that it is not elsewhere. On Stampede 2, where I do not have this issue, cat /sys/module/loop/parameters/max_loop
returns 0.