ompi Slurm resource detection issue

-np 144 jobs in last nights MTT with the latest prrte pointers are failing with:

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 144slots that were requested by
the application:

  ./c_hello

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)  3. Resource manager (e.g., SLURM, PBS/Torque, LSF,
etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

Link here to MTT results.

Aug 09 '22 15:08 awlauria

I would advise adding --display alloc to the cmd line to see what PRRTE thinks it was given.

Aug 09 '22 22:08 rhc54

Is somebody going to have a chance to look at this soon? I doubt the problem is with the Slurm allocation parser as that hasn't changed, so it is likely a bug down in the mapper. I can try to take a look here, but it would help if somebody added --prtemca rmaps_base_verbose 5 to one of those runs and sent me the output (or post it to a PRRTE issue so I can see it).

Aug 11 '22 13:08 rhc54

@rhc54 I'll try reproduce on my system

Aug 11 '22 15:08 janjust

I think this was hit in our MTT, I can also give it a shot to reproduce

Aug 11 '22 17:08 wckzhang

Any update on this?

Aug 15 '22 19:08 awlauria

Guys - I hate to release PRRTE v3.0 with this unresolved, but I have tried everything at my disposal to reproduce (testing under Slurm, faking RM allocations) with your cmd line without any failures. Minus some input from your environment, I have no choice but to declare this an unverifiable anomaly and move forward with the release.

So please - can someone just produce the requested debug so we can address this?

Aug 19 '22 14:08 rhc54

I cannot reproduce this on my system either - this happened on an AWS system, ideally it should be reproduced there

Aug 19 '22 14:08 janjust

@wckzhang ???

Aug 22 '22 16:08 gpaulsen

@wckzhang is on vacation. @shijin-aws can you take a look at this?

Aug 22 '22 16:08 wzamazon

will look at this today.

Aug 22 '22 16:08 shijin-aws

I can reproduce this issue, it is happening randomly. I am inside a salloc -n 144 and running the same test (alltoallv_somezeros) in a for loop and this error is hit in random iteration numbers. In the log below iteration 1,2,3 failed, 4,5 succeeded.

(env) (env) bash-4.2$ for i in $(seq 1 5); do echo "iteration $i: mpirun -n 144 collective/alltoallv_somezeros"; /home/ec2-user/mtt-scratch/installs/wogy/install/bin/mpirun -n 144 /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros; done
iteration 1: mpirun -n 144 collective/alltoallv_somezeros
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 144
slots that were requested by the application:

  /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
iteration 2: mpirun -n 144 collective/alltoallv_somezeros
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 144
slots that were requested by the application:

  /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
iteration 3: mpirun -n 144 collective/alltoallv_somezeros
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 144
slots that were requested by the application:

  /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
iteration 4: mpirun -n 144 collective/alltoallv_somezeros
No Errors
iteration 5: mpirun -n 144 collective/alltoallv_somezeros
No Errors

Aug 22 '22 21:08 shijin-aws

added --display alloc and the log is

(env) (env) bash-4.2$ for i in $(seq 1 5); do echo "iteration $i: mpirun -n 144 collective/alltoallv_somezeros"; /home/ec2-user/mtt-scratch/installs/wogy/install/bin/mpirun -n 144 --display alloc /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros; done
iteration 1: mpirun -n 144 collective/alltoallv_somezeros

======================   ALLOCATED NODES   ======================
    queue-c5n18xlarge-dy-c5n18xlarge-1: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.2.155
    queue-c5n18xlarge-dy-c5n18xlarge-2: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.9.252
    queue-c5n18xlarge-dy-c5n18xlarge-3: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.15.79
    queue-c5n18xlarge-dy-c5n18xlarge-4: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.5.159
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 144
slots that were requested by the application:

  /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
iteration 2: mpirun -n 144 collective/alltoallv_somezeros

======================   ALLOCATED NODES   ======================
    queue-c5n18xlarge-dy-c5n18xlarge-1: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.2.155
    queue-c5n18xlarge-dy-c5n18xlarge-2: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.9.252
    queue-c5n18xlarge-dy-c5n18xlarge-3: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.15.79
    queue-c5n18xlarge-dy-c5n18xlarge-4: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.5.159
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 144
slots that were requested by the application:

  /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
iteration 3: mpirun -n 144 collective/alltoallv_somezeros

======================   ALLOCATED NODES   ======================
    queue-c5n18xlarge-dy-c5n18xlarge-1: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.2.155
    queue-c5n18xlarge-dy-c5n18xlarge-2: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.9.252
    queue-c5n18xlarge-dy-c5n18xlarge-3: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.15.79
    queue-c5n18xlarge-dy-c5n18xlarge-4: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.5.159
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 144
slots that were requested by the application:

  /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
iteration 4: mpirun -n 144 collective/alltoallv_somezeros

======================   ALLOCATED NODES   ======================
    queue-c5n18xlarge-dy-c5n18xlarge-1: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.2.155
    queue-c5n18xlarge-dy-c5n18xlarge-2: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.9.252
    queue-c5n18xlarge-dy-c5n18xlarge-3: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.15.79
    queue-c5n18xlarge-dy-c5n18xlarge-4: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.5.159
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 144
slots that were requested by the application:

  /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
iteration 5: mpirun -n 144 collective/alltoallv_somezeros

======================   ALLOCATED NODES   ======================
    queue-c5n18xlarge-dy-c5n18xlarge-1: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.2.155
    queue-c5n18xlarge-dy-c5n18xlarge-2: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.9.252
    queue-c5n18xlarge-dy-c5n18xlarge-3: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.15.79
    queue-c5n18xlarge-dy-c5n18xlarge-4: slots=36 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 172.31.5.159
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 144
slots that were requested by the application:

  /home/ec2-user/mtt-scratch/installs/wogy/tests/ibm/ibm/collective/alltoallv_somezeros

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------

Aug 22 '22 21:08 shijin-aws

@rhc54 I attached the log that run with --display alloc --prtemca rmaps_base_verbose 5 as you suggested

alltoallv_somezeros.txt

Aug 22 '22 21:08 shijin-aws

Well, that last log shows me what is going on (two of the nodes are being dropped for some reason) - now I just have to figure out why! Might need some more debug, so I may be back.

Aug 23 '22 06:08 rhc54

Thanks @rhc54 . Were you able to root cause?

Aug 25 '22 13:08 awlauria

Haven't gotten there yet. What I saw in a quick scan is that two of the nodes were being skipped when we assemble the node list for mapping. I don't have any immediate idea as to why that happened. Once I get the remaining cmd line issues resolved (hopefully today), I plan to come back and look at this one. Probably have to add some verbose/debug code and ask for it to be re-run.

Aug 25 '22 17:08 rhc54

Crud - I'm dense. I don't need you to make more measurements - this is happening in the mapping phase. All I need is for someone to post the xml lstopo output from one of those nodes.

Can someone do that please?

Aug 25 '22 19:08 rhc54

@rhc54 I can do that, may I know what command should I run to get that?

Aug 25 '22 19:08 shijin-aws

I believe it is simply lstopo --of xml > file

Aug 25 '22 20:08 rhc54

lstopo.txt

Aug 25 '22 22:08 shijin-aws

Rats - it works perfectly for me with that topology, so it has to be something else. Can you please update the openpmix and prrte submodules to head of their master branches, rebuild, and then rerun the cmd line with --display alloc --prtemca rmaps_base_verbose 5?

Aug 25 '22 22:08 rhc54

@rhc54 if I build ompi from github main source (with updated submodules). There is no such issue. But if I use the latest nightly main tarball https://download.open-mpi.org/nightly/open-mpi/main/openmpi-main-202208250241-96fadd9.tar.bz2, I can reproduce this issue

I build github ompi main source as

git clone https://github.com/openmpi/ompi ompi-main
cd ompi-main
git submodule update --recursive --init
./configure CFLAGS=-pipe --enable-picky --enable-debug --without-verbs --with-ofi=/opt/amazon/efa/ --enable-mpi1-compatibility --prefix=/home/ec2-user/ompi-main/install --disable-man-pages
make -j  install

I build the nightly tarball as

./configure CFLAGS=-pipe --enable-picky --enable-debug --without-verbs --with-ofi=/opt/amazon/efa/ --enable-mpi1-compatibility --prefix=/home/ec2-user/openmpi-main-202208250241-96fadd9/install

I am not sure if there is a difference on the pmix/prrte between the latest nightly tarball and the github main. A naive diff returns me a lot of difference...

I will try to copy the pmix/prrte to the repo of nightly tarball to see if it can fix the issue.

Aug 26 '22 19:08 shijin-aws

You'll probably need to re-run autogen.pl once you copy them over since pmix/prrte will be coming from a git repo and not a tarball, but it should otherwise be okay. Afraid I don't know how old the pmix/prrte code is in the nightly tarball - could be fairly old.

Aug 26 '22 19:08 rhc54

After copying and ./autogen.pl, I hit this error during configure

============================================================================
== Configure PMIx
============================================================================
checking --with-pmix value... not found
configure: WARNING: Expected file /usr/include/pmix.h not found
configure: error: Cannot continue
configure: ===== done with 3rd-party/prrte configure =====
configure: error: PRRTE configuration failed.  Cannot continue.

Aug 26 '22 20:08 shijin-aws

The latest github main commit is on Aug 24th, I thought the nightly main tarball on Aug 25 should be the same with the main branch. But I do not know how the tarball is generated.

Aug 26 '22 20:08 shijin-aws

How about this: do a git clone of OMPI, then do the submodule init. Go into the 3rd-party/openpmix and 3rd-party/prrte and in each one do git checkout master; git pull

Then just build OMPI as usual.

Aug 26 '22 20:08 rhc54

Oh, sorry - that's what you already did and it worked fine, yes? If so, then aren't we done? It's the tarball that is having the problem.

Aug 26 '22 20:08 rhc54

Oh, sorry - that's what you already did and it worked fine, yes? If so, then aren't we done? It's the tarball that is having the problem.

Yes, I even did not bother checking out prrte and openpmix to master, I just use whatever is bumped to ompi main.

Aug 26 '22 20:08 shijin-aws

Okay - the commit history indicates that the PMIx/PRRTE submodule pointers were last updated on Aug 24. I'm guessing your nightly tarball was from before that date? If so, that would explain the difference.

Aug 26 '22 21:08 rhc54

This nightly tarball https://download.open-mpi.org/nightly/open-mpi/main/openmpi-main-202208250241-96fadd9.tar.bz2 indicates it's generated on Aug 25 but I can still hit the issue with it.

Aug 26 '22 21:08 shijin-aws

ompi ompi copied to clipboard

Slurm resource detection issue

ompi
ompi copied to clipboard