Results 115 comments of Denis

interesting indeed, but isn't it that if i restart the main `slurmd` itself, i also kill all `slurmstepd` that could eventually be still attached to it ?

Using this submit script ``` #!/bin/bash -l #SBATCH -J t_pp #SBATCH -p debug #SBATCH --nodes=1 #SBATCH --ntasks-per-node=20 #SBATCH --constraint=amd,epyc #SBATCH --time 00:30:00 #SBATCH -o %A_%a.out #SBATCH -e %A_%a.err #SBATCH --nodelist=lxbk0719...

i am not sure i understood the config dir approach, how to do that ?

if you try to reproduce this problem it will help us a lot to understand this strange behavior !

the simple.sh script is the following ``` #!/bin/bash ./simple_pp ``` The simple_pp executable comes from the following MPI code : ``` #include #include int main(int argc, char** argv){ int rank,...

This is the slurm version we use: ``` [dbertini@lxbk1130 ~]$ rpm -qa slurm* slurm-slurmd-21.08.8-3~gsi1.el8.x86_64 slurm-libs-21.08.8-3~gsi1.el8.x86_64 slurm-singularity-exec-21.08-1.x86_64 slurm-21.08.8-3~gsi1.el8.x86_64 ```

on the problematic node: ``` dbertini@lxbk0719 ~]$ unshare -Ur unshare: unshare failed: No space left on device [dbertini@lxbk0719 ~]$ ps -e -o pid,userns,args | grep -v '^[ ]*[1-9][0-9]*[ ]*-' PID...

@DrDaveD I will run wihout MPI at all to see if the problem still exists

Indeed, but here in my cases i guess we are not hitting the `32` nested namespace but the maximum of user namespace created from the `root` namespace, i.e the kernel...

@DrDaveD I can confirm that, without MPI running inside the container, the problem disappears.