singularity icon indicating copy to clipboard operation
singularity copied to clipboard

No Loop Devices Error when using a large number of MPI processes

Open georgiastuart opened this issue 3 years ago • 8 comments

Version of Singularity:

3.7.2-3.el7

Problem Setup

I am attempting to run a singularity container with a large number of MPI processes (56 per node) on the Frontera supercomputer. This problem is similar or identical to #3928 , which was marked resolved in v3.5 with PR #4549.

The command I'm using to run the job is:

ibrun singularity exec --bind $SCRATCH:$SCRATCH <singularity image> python <python script>

On the system, SHARED LOOP DEVICES is enabled and MAX LOOP DEVICES is set to 512.

Expected behavior

I expect MPI to launch 56 processes running singularity.

Actual behavior

I get the following error:

FATAL:   container creation failed: mount /proc/self/fd/6->/var/singularity/mnt/session/rootfs error: while mounting image /proc/self/fd/6: failed to find loop device: could not attach image file to loop device: no loop devices available

This error does not seem to happen all the time and happens less with fewer processes.

Steps to reproduce this behavior

Run singularity through a large number of MPI processes.

What OS/distro are you running

CentOS Linux 7

georgiastuart avatar Aug 25 '21 17:08 georgiastuart

I assigned this to Cedric, but he's still on vacation so it could take a while. You might want to also report it to sylabs/singularity.

DrDaveD avatar Aug 27 '21 18:08 DrDaveD

Thanks! I'll report it on the sylabs repo as well.

georgiastuart avatar Aug 28 '21 10:08 georgiastuart

Hi @georgiastuart,

Can you post the --debug output of this issue please?

Thank you! Greg

gmkurtzer avatar Aug 30 '21 03:08 gmkurtzer

Here's the debug info! https://gist.github.com/georgiastuart/3172483873ab002728497e0f8d4e721f

georgiastuart avatar Sep 02 '21 12:09 georgiastuart

Hi @georgiastuart, thank you.

I don't think this caught all of the debug output as it didn't even get to the loop device setups. If there is just too much output, please grep through the log output for loop and forward just those results.

Also, if you can count how many loop devices are already setup with the command ls -l /dev/loop* | wc -l

gmkurtzer avatar Sep 03 '21 04:09 gmkurtzer

@gmkurtzer sorry about that! It looks like the gist was too long to display, but if you view the raw file you can see the loop device lines. Regardless, I've isolated the loop lines here.

ls -l /dev/loop* | wc -l returns 1.

georgiastuart avatar Sep 03 '21 12:09 georgiastuart

@georgiastuart What's the value returned by cat /sys/module/loop/parameters/max_loop on those nodes ?

cclerget avatar Sep 07 '21 11:09 cclerget

@cclerget sorry for the delay. That file does not exist on Frontera. I contacted support and confirmed that it is not elsewhere. On Stampede 2, where I do not have this issue, cat /sys/module/loop/parameters/max_loop returns 0.

georgiastuart avatar Sep 15 '21 20:09 georgiastuart