ompi icon indicating copy to clipboard operation
ompi copied to clipboard

MPI_Comm_split_type does not behave as expected

Open jkarns275 opened this issue 3 years ago • 16 comments

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Open MPI 4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

pacman

Please describe the system on which you are running

  • Operating system/version: Arch Linux, Kernel 5.17.9
  • Computer hardware: EPYC 7551p, 64 GB ram,
  • Network type: Ethernet

Details of the problem

I would expect to following program to create 8 separate communicators that each contain 8 processes when ran with 64 threads on a 7551p processor (which has 8 L3 cache banks):

#include <mpi.h>    
#include <stdio.h>    
    
int main(int argc, char **argv) {    
    MPI_Init(&argc, &argv);    
    int world_rank, world_size;    
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);    
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);    
    
    MPI_Comm l3_comm;                                                                                                                                                                      
    int l3_rank, l3_size;    
    int result = MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, world_rank, MPI_INFO_NULL, &l3_comm);    
    int r0 = MPI_Comm_rank(l3_comm, &l3_rank);                                                                                                                                                                                                                                          
    int r1 = MPI_Comm_size(l3_comm, &l3_size);    
                                                                                                                                                                        
    if (world_rank == 0) {    
        printf("%d %d %d\n", result, r0, r1);    
    }    
    printf("GLOBAL %2.2d / %2.2d L3 %2.2d / %2.2d\n", world_rank, world_size, l3_rank, l3_size);    
                                                                                                                                                                                                                                                                      
    MPI_Barrier(MPI_COMM_WORLD);                                                                                                                                                                                                                                          
    MPI_Comm_free(&l3_comm);                                                                                                                                                                        
    MPI_Finalize();                                                                                                                                                                                 
}

Instead, it creates 64 separate communicators that each contain 1 process. This program actually outputs the following (I trimmed this down a bit for brevity):

GLOBAL 63 / 64 L3 00 / 01
GLOBAL 02 / 64 L3 00 / 01
GLOBAL 26 / 64 L3 00 / 01
GLOBAL 27 / 64 L3 00 / 01
GLOBAL 28 / 64 L3 00 / 01
GLOBAL 29 / 64 L3 00 / 01
GLOBAL 31 / 64 L3 00 / 01
GLOBAL 32 / 64 L3 00 / 01
GLOBAL 34 / 64 L3 00 / 01
GLOBAL 35 / 64 L3 00 / 01
0 0 0
GLOBAL 37 / 64 L3 00 / 01
GLOBAL 38 / 64 L3 00 / 01
GLOBAL 39 / 64 L3 00 / 01
GLOBAL 40 / 64 L3 00 / 01
GLOBAL 41 / 64 L3 00 / 01
GLOBAL 43 / 64 L3 00 / 01
GLOBAL 44 / 64 L3 00 / 01
GLOBAL 48 / 64 L3 00 / 01
GLOBAL 49 / 64 L3 00 / 01
GLOBAL 50 / 64 L3 00 / 01
GLOBAL 51 / 64 L3 00 / 01
GLOBAL 52 / 64 L3 00 / 01
GLOBAL 53 / 64 L3 00 / 01
GLOBAL 54 / 64 L3 00 / 01
GLOBAL 55 / 64 L3 00 / 01
GLOBAL 56 / 64 L3 00 / 01
GLOBAL 57 / 64 L3 00 / 01
(continues for all 64 processes)

The program also gives identical results for other split types (e.g. OMPI_COMM_TYPE_NUMA, OMPI_COMM_TYPE_CORE).

jkarns275 avatar May 21 '22 20:05 jkarns275

I am unable to replicate the issue -- your program seems to work fine for me on an older 16 core x86 machine (with 2 NUMA nodes, each one with a single shared L3 cache):

$ mpicc split-type.c -g -O0 -o split-type && mpirun -np 16 ./split-type
0 0 0
GLOBAL 01 / 16 L3 00 / 08
GLOBAL 02 / 16 L3 01 / 08
GLOBAL 03 / 16 L3 01 / 08
...
etc.

Can you send the hwloc lstopo output from your machine?

Also, I doubt that we have changed much in this area, but there are some 2nd/3rd-level effects possible from other dependencies: can you try upgrading to the latest 4.1.4rc from https://www.open-mpi.org/software/ompi/v4.1/ to see if the output is different?

jsquyres avatar May 23 '22 20:05 jsquyres

I can replicate it by just changing the type (L2 instead of L3). It seems the root cause is an incorrect locality in the proc info (proc_flags) in ompi_comm_split_type_get_part. I didn't yet had time to look at how the locality info was obtained.

bosilca avatar May 23 '22 22:05 bosilca

lstopo output:

Machine (63GB total) + Package L#0
  Die L#0
    L3 L#0 (8192KB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#32)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#33)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#34)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#35)
    L3 L#1 (8192KB)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#36)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#37)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#38)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#39)
    HostBridge
      PCIBridge
        PCIBridge
          PCI 02:00.0 (VGA)
      PCIBridge
        PCI 04:00.2 (SATA)
  Die L#1
    NUMANode L#0 (P#1 31GB)
    L3 L#2 (8192KB)
      L2 L#8 (512KB) + L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#40)
      L2 L#9 (512KB) + L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#41)
      L2 L#10 (512KB) + L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#42)
      L2 L#11 (512KB) + L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#43)
    L3 L#3 (8192KB)
      L2 L#12 (512KB) + L1d L#12 (32KB) + L1i L#12 (64KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#44)
      L2 L#13 (512KB) + L1d L#13 (32KB) + L1i L#13 (64KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#45)
      L2 L#14 (512KB) + L1d L#14 (32KB) + L1i L#14 (64KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#46)
      L2 L#15 (512KB) + L1d L#15 (32KB) + L1i L#15 (64KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#47)
  Die L#2
    L3 L#4 (8192KB)
      L2 L#16 (512KB) + L1d L#16 (32KB) + L1i L#16 (64KB) + Core L#16
        PU L#32 (P#16)
        PU L#33 (P#48)
      L2 L#17 (512KB) + L1d L#17 (32KB) + L1i L#17 (64KB) + Core L#17
        PU L#34 (P#17)
        PU L#35 (P#49)
      L2 L#18 (512KB) + L1d L#18 (32KB) + L1i L#18 (64KB) + Core L#18
        PU L#36 (P#18)
        PU L#37 (P#50)
      L2 L#19 (512KB) + L1d L#19 (32KB) + L1i L#19 (64KB) + Core L#19
        PU L#38 (P#19)
        PU L#39 (P#51)
    L3 L#5 (8192KB)
      L2 L#20 (512KB) + L1d L#20 (32KB) + L1i L#20 (64KB) + Core L#20
        PU L#40 (P#20)
        PU L#41 (P#52)
      L2 L#21 (512KB) + L1d L#21 (32KB) + L1i L#21 (64KB) + Core L#21
        PU L#42 (P#21)
        PU L#43 (P#53)
      L2 L#22 (512KB) + L1d L#22 (32KB) + L1i L#22 (64KB) + Core L#22
        PU L#44 (P#22)
        PU L#45 (P#54)
      L2 L#23 (512KB) + L1d L#23 (32KB) + L1i L#23 (64KB) + Core L#23
        PU L#46 (P#23)
        PU L#47 (P#55)
    HostBridge
      PCIBridge
        PCI 41:00.0 (NVMExp)
          Block(Disk) "nvme0n1"
      PCIBridge
        PCI 42:00.0 (NVMExp)
          Block(Disk) "nvme1n1"
      PCIBridge
        PCI 44:00.2 (SATA)
          Block(Disk) "sdd"
          Block(Disk) "sdb"
          Block(Disk) "sdc"
          Block(Disk) "sda"
  Die L#3
    NUMANode L#1 (P#3 31GB)
    L3 L#6 (8192KB)
      L2 L#24 (512KB) + L1d L#24 (32KB) + L1i L#24 (64KB) + Core L#24
        PU L#48 (P#24)
        PU L#49 (P#56)
      L2 L#25 (512KB) + L1d L#25 (32KB) + L1i L#25 (64KB) + Core L#25
        PU L#50 (P#25)
        PU L#51 (P#57)
      L2 L#26 (512KB) + L1d L#26 (32KB) + L1i L#26 (64KB) + Core L#26
        PU L#52 (P#26)
        PU L#53 (P#58)
      L2 L#27 (512KB) + L1d L#27 (32KB) + L1i L#27 (64KB) + Core L#27
        PU L#54 (P#27)
        PU L#55 (P#59)
    L3 L#7 (8192KB)
      L2 L#28 (512KB) + L1d L#28 (32KB) + L1i L#28 (64KB) + Core L#28
        PU L#56 (P#28)
        PU L#57 (P#60)
      L2 L#29 (512KB) + L1d L#29 (32KB) + L1i L#29 (64KB) + Core L#29
        PU L#58 (P#29)
        PU L#59 (P#61)
      L2 L#30 (512KB) + L1d L#30 (32KB) + L1i L#30 (64KB) + Core L#30
        PU L#60 (P#30)
        PU L#61 (P#62)
      L2 L#31 (512KB) + L1d L#31 (32KB) + L1i L#31 (64KB) + Core L#31
        PU L#62 (P#31)
        PU L#63 (P#63)
    HostBridge
      PCIBridge
        PCI 62:00.0 (Ethernet)
          Net "enp98s0f0"
        PCI 62:00.1 (Ethernet)
          Net "enp98s0f1"
I will try 4.1.4rc now.

jkarns275 avatar May 23 '22 22:05 jkarns275

I compiled 4.1.4rc and still get the same output. It may be worth mentioning that my numa setup is sort of messed up since I only have two sticks of ram but my CPU has four numa nodes, so I believe numa is not actually being used. That could possibly be messing with things? I have some sticks of ram on the way so we will see if thats the case.

numactl output:

$ numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 32085 MB
node 1 free: 23843 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 32192 MB
node 3 free: 25938 MB
node distances:
node   0   1   2   3 
  0:  10  16  16  16 
  1:  16  10  16  16 
  2:  16  16  10  16 
  3:  16  16  16  10 
$ numactl -s
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 
cpubind: 1 3 
nodebind: 1 3 
membind: 1 3

Is there anything else I could try in the mean time?

jkarns275 avatar May 23 '22 23:05 jkarns275

@jkarns275 I talked about this today with @bosilca. He's able to reproduce a variant of this issue; I'm not. For some reason, his flags are getting set incorrectly (similar to how yours are apparently getting set incorrectly). Mine are apparently getting set correctly. ☹️ He's going to dig into this more later today (i.e., to see how the flags are getting set incorrectly).

FWIW: Your RAM situation should not really affect the issue; lstopo showed the correct information for your system (i.e., correct grouping by L3 cache, which shouldn't be affected by RAM anyway). Just for another data point, can you also run with mpirun --report-bindings to ensure that your processes are bound where we think they should be bound?

jsquyres avatar May 24 '22 15:05 jsquyres

Let me take back my earlier comment about my flags not being set correctly. I was testing on OSX, where it is impossible to bind processes to specific resources. Thus, PRRTE correctly reported all the information it could, aka. my processes were located on the same node. Let me move to a Linux box to see what I get there.

bosilca avatar May 24 '22 21:05 bosilca

All good on Linux, works as expected. Here is how to run it to get as much info as possible from the runtime: mpirun -x HWLOC_DEBUG_VERBOSE=0 -np 8 --report-bindings ./comm_split.

If you want to see what HWLOC finds on your node, change HWLOC_DEBUG_VERBOSE to 1 on the command line above. You should get something very similar to the output of hwloc-ls.

bosilca avatar May 24 '22 21:05 bosilca

I noticed that if I run with 16 or fewer cores, it appears to work: mpirun -x HWLOC_DEBUG_VERBOSE=1 -np 16 --report-bindings ./a.out 2> out yields:

[server:369121] MCW rank 12 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 13 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 14 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 15 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 0 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 1 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 2 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 3 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 4 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 5 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 6 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 7 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 8 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 9 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 10 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 11 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
GLOBAL 02 / 16 L3 01 / 08
GLOBAL 03 / 16 L3 01 / 08
GLOBAL 04 / 16 L3 02 / 08
GLOBAL 05 / 16 L3 02 / 08
GLOBAL 06 / 16 L3 03 / 08
GLOBAL 07 / 16 L3 03 / 08
GLOBAL 08 / 16 L3 04 / 08
GLOBAL 09 / 16 L3 04 / 08
GLOBAL 10 / 16 L3 05 / 08
GLOBAL 11 / 16 L3 05 / 08
GLOBAL 12 / 16 L3 06 / 08
GLOBAL 13 / 16 L3 06 / 08
GLOBAL 14 / 16 L3 07 / 08
GLOBAL 15 / 16 L3 07 / 08
0 0 0
GLOBAL 00 / 16 L3 00 / 08
GLOBAL 01 / 16 L3 00 / 08

With any more than 16 cores, we get what I put previously (groups of size one) but those extra details seem to have revealed something, although I'm not entirely sure what meaning is: mpirun -x HWLOC_DEBUG_VERBOSE=1 -np 17 --report-bindings ./a.out > out 2>&1

[server:369412] MCW rank 7 is not bound (or bound to all available processors)
[server:369413] MCW rank 8 is not bound (or bound to all available processors)
[server:369407] MCW rank 3 is not bound (or bound to all available processors)
[server:369423] MCW rank 12 is not bound (or bound to all available processors)
[server:369421] MCW rank 11 is not bound (or bound to all available processors)
[server:369408] MCW rank 4 is not bound (or bound to all available processors)
[server:369435] MCW rank 15 is not bound (or bound to all available processors)
[server:369416] MCW rank 9 is not bound (or bound to all available processors)
[server:369406] MCW rank 2 is not bound (or bound to all available processors)
[server:369404] MCW rank 0 is not bound (or bound to all available processors)
[server:369409] MCW rank 5 is not bound (or bound to all available processors)
[server:369405] MCW rank 1 is not bound (or bound to all available processors)
[server:369418] MCW rank 10 is not bound (or bound to all available processors)
[server:369410] MCW rank 6 is not bound (or bound to all available processors)
[server:369432] MCW rank 14 is not bound (or bound to all available processors)
[server:369439] MCW rank 16 is not bound (or bound to all available processors)
[server:369428] MCW rank 13 is not bound (or bound to all available processors)
GLOBAL 11 / 17 L3 00 / 01
GLOBAL 12 / 17 L3 00 / 01
GLOBAL 13 / 17 L3 00 / 01
GLOBAL 14 / 17 L3 00 / 01
GLOBAL 15 / 17 L3 00 / 01
GLOBAL 16 / 17 L3 00 / 01
0 0 0
GLOBAL 00 / 17 L3 00 / 01
GLOBAL 01 / 17 L3 00 / 01
GLOBAL 02 / 17 L3 00 / 01
GLOBAL 03 / 17 L3 00 / 01
GLOBAL 04 / 17 L3 00 / 01
GLOBAL 05 / 17 L3 00 / 01
GLOBAL 06 / 17 L3 00 / 01
GLOBAL 07 / 17 L3 00 / 01
GLOBAL 08 / 17 L3 00 / 01
GLOBAL 09 / 17 L3 00 / 01
GLOBAL 10 / 17 L3 00 / 01

I wonder if this could be related to the numa domains? The size of a numa node on my system is 16 cores

Here is that --report-bindings output when I use 16 cores:

[server:369121] MCW rank 12 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 13 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 14 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 15 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 0 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 1 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 2 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 3 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 4 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 5 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 6 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 7 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 8 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 9 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:369121] MCW rank 10 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:369121] MCW rank 11 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
GLOBAL 02 / 16 L3 01 / 08
GLOBAL 03 / 16 L3 01 / 08
GLOBAL 04 / 16 L3 02 / 08
GLOBAL 05 / 16 L3 02 / 08
GLOBAL 06 / 16 L3 03 / 08
GLOBAL 07 / 16 L3 03 / 08
GLOBAL 08 / 16 L3 04 / 08
GLOBAL 09 / 16 L3 04 / 08
GLOBAL 10 / 16 L3 05 / 08
GLOBAL 11 / 16 L3 05 / 08
GLOBAL 12 / 16 L3 06 / 08
GLOBAL 13 / 16 L3 06 / 08
GLOBAL 14 / 16 L3 07 / 08
GLOBAL 15 / 16 L3 07 / 08
0 0 0
GLOBAL 00 / 16 L3 00 / 08
GLOBAL 01 / 16 L3 00 / 08

jkarns275 avatar May 24 '22 21:05 jkarns275

I will not pretend I understand why, but according to your output as soon as you start more than 16 processes (on your 32 cores node) your processes are not bound to resources anymore. Therefore, floating processes cannot be split by any type other than node.

I would reach out to the PRRTE folks to get more help.

bosilca avatar May 24 '22 21:05 bosilca

@bosilca He's using OMPI v4.1.x -- PRTE isn't relevant.

jsquyres avatar May 24 '22 21:05 jsquyres

Is there a more bleeding edge build of MPI I should try? Perhaps one that actually uses this so called PRRTE may avoid the issue.

jkarns275 avatar May 24 '22 22:05 jkarns275

Up to Open MPI v4.x, the underlying runtime system was named ORTE; it was included in Open MPI. Starting with the upcoming Open MPI v5.0.0, ORTE was split off into a separate project: PRTE. Hence, PRTE is (sort of) an evolution from ORTE. But it's not a straight evolution - it contains a significant number of changes and differences compared to its ancestor ORTE.

You can try the latest Open MPI v5.0.0rc, but it isn't fully stable yet: https://www.open-mpi.org/software/ompi/v5.0/. It would be an interesting data point to see if the same behavior persists. But it may be (effectively) unrelated to the ORTE behavior in Open MPI v4.1.x.

Bottom line: if there's a bug in Open MPI v4.1's ORTE with regards to binding, we may have to fix that separate from PRTE.

jsquyres avatar May 24 '22 22:05 jsquyres

Ive received those additional sticks of ram. If for some reason that resolves the issue we’ll have a good idea of what may be causing the bug?

ill try v5.0.0rc before and after this to see if it makes a difference.

jkarns275 avatar May 25 '22 18:05 jkarns275

Oddly enough, v5.0.0rc appears to work properly with two sticks (ran with 32 processes here):

[server:1555791] MCW rank 0 bound to package[0][core:0-7]
[server:1555791] MCW rank 2 bound to package[0][core:0-7]
[server:1555791] MCW rank 1 bound to package[0][core:0-7]
[server:1555791] MCW rank 3 bound to package[0][core:0-7]
[server:1555791] MCW rank 7 bound to package[0][core:0-7]
[server:1555791] MCW rank 6 bound to package[0][core:0-7]
[server:1555791] MCW rank 4 bound to package[0][core:0-7]
[server:1555791] MCW rank 5 bound to package[0][core:0-7]
[server:1555791] MCW rank 11 bound to package[0][core:8-15]
[server:1555791] MCW rank 10 bound to package[0][core:8-15]
[server:1555791] MCW rank 8 bound to package[0][core:8-15]
[server:1555791] MCW rank 9 bound to package[0][core:8-15]
[server:1555791] MCW rank 15 bound to package[0][core:8-15]
[server:1555791] MCW rank 14 bound to package[0][core:8-15]
[server:1555791] MCW rank 12 bound to package[0][core:8-15]
[server:1555791] MCW rank 19 bound to package[0][core:16-23]
[server:1555791] MCW rank 13 bound to package[0][core:8-15]
[server:1555791] MCW rank 23 bound to package[0][core:16-23]
[server:1555791] MCW rank 16 bound to package[0][core:16-23]
[server:1555791] MCW rank 18 bound to package[0][core:16-23]
[server:1555791] MCW rank 17 bound to package[0][core:16-23]
[server:1555791] MCW rank 27 bound to package[0][core:24-31]
[server:1555791] MCW rank 22 bound to package[0][core:16-23]
[server:1555791] MCW rank 31 bound to package[0][core:24-31]
[server:1555791] MCW rank 20 bound to package[0][core:16-23]
[server:1555791] MCW rank 26 bound to package[0][core:24-31]
[server:1555791] MCW rank 30 bound to package[0][core:24-31]
[server:1555791] MCW rank 24 bound to package[0][core:24-31]
[server:1555791] MCW rank 25 bound to package[0][core:24-31]
[server:1555791] MCW rank 21 bound to package[0][core:16-23]
[server:1555791] MCW rank 28 bound to package[0][core:24-31]
[server:1555791] MCW rank 29 bound to package[0][core:24-31]
GLOBAL 03 / 32 L3 03 / 08
GLOBAL 01 / 32 L3 01 / 08
GLOBAL 11 / 32 L3 03 / 08
GLOBAL 12 / 32 L3 04 / 08
GLOBAL 02 / 32 L3 02 / 08
GLOBAL 06 / 32 L3 06 / 08
GLOBAL 15 / 32 L3 07 / 08
GLOBAL 27 / 32 L3 03 / 08
GLOBAL 13 / 32 L3 05 / 08
GLOBAL 07 / 32 L3 07 / 08
GLOBAL 24 / 32 L3 00 / 08
GLOBAL 09 / 32 L3 01 / 08
GLOBAL 29 / 32 L3 05 / 08
GLOBAL 30 / 32 L3 06 / 08
GLOBAL 26 / 32 L3 02 / 08
GLOBAL 23 / 32 L3 07 / 08
GLOBAL 21 / 32 L3 05 / 08
GLOBAL 19 / 32 L3 03 / 08
GLOBAL 10 / 32 L3 02 / 08
GLOBAL 18 / 32 L3 02 / 08
GLOBAL 05 / 32 L3 05 / 08
GLOBAL 28 / 32 L3 04 / 08
GLOBAL 20 / 32 L3 04 / 08
GLOBAL 14 / 32 L3 06 / 08
GLOBAL 25 / 32 L3 01 / 08
GLOBAL 31 / 32 L3 07 / 08
GLOBAL 22 / 32 L3 06 / 08
0 0 0
GLOBAL 00 / 32 L3 00 / 08
GLOBAL 17 / 32 L3 01 / 08
GLOBAL 16 / 32 L3 00 / 08
GLOBAL 04 / 32 L3 04 / 08
GLOBAL 08 / 32 L3 00 / 08

Version 5.0.0rc itself has some issues though. For example, the --use-hwthread-cpus option causes the program to exit without outputting anything: mpirun -x HWLOC_DEBUG_VERBOSE=0 --use-hwthread-cpus -np 16 --report-binding ./a.out

The fact that 4.1 fails while 5.0.0rc succeeds makes me thing it is in fact a bug with ORTE. How should we proceed from here?

jkarns275 avatar May 25 '22 21:05 jkarns275

So oddly enough now that each of my numa nodes actually has RAM, the split function appears to work as expected when I use mpi 4.1.4rc:

mpirun -x HWLOC_DEBUG_VERBOSE=0 -np 17 --report-bindings ./a.out

[server:247961] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../../../../../../../../../..]
[server:247961] MCW rank 1 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:247961] MCW rank 2 bound to socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]]: [../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../..]
[server:247961] MCW rank 3 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:247961] MCW rank 4 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../../../../../../../../../..]
[server:247961] MCW rank 5 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:247961] MCW rank 6 bound to socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]]: [../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../..]
[server:247961] MCW rank 7 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:247961] MCW rank 8 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../../../../../../../../../..]
[server:247961] MCW rank 9 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:247961] MCW rank 10 bound to socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]]: [../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../..]
[server:247961] MCW rank 11 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:247961] MCW rank 12 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../../../../../../../../../..]
[server:247961] MCW rank 13 bound to socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]]: [../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../..]
[server:247961] MCW rank 14 bound to socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]]: [../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../..]
[server:247961] MCW rank 15 bound to socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../../BB/BB/BB/BB/BB/BB/BB/BB]
[server:247961] MCW rank 16 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB/../../../../../../../../../../../../../../../../../../../../../../../..]
GLOBAL 09 / 17 L3 02 / 04
GLOBAL 10 / 17 L3 02 / 04
GLOBAL 11 / 17 L3 02 / 04
GLOBAL 12 / 17 L3 03 / 05
GLOBAL 13 / 17 L3 03 / 04
GLOBAL 14 / 17 L3 03 / 04
GLOBAL 15 / 17 L3 03 / 04
GLOBAL 16 / 17 L3 04 / 05
GLOBAL 01 / 17 L3 00 / 04
GLOBAL 02 / 17 L3 00 / 04
GLOBAL 03 / 17 L3 00 / 04
GLOBAL 04 / 17 L3 01 / 05
GLOBAL 05 / 17 L3 01 / 04
GLOBAL 06 / 17 L3 01 / 04
GLOBAL 07 / 17 L3 01 / 04
GLOBAL 08 / 17 L3 02 / 05
0 0 0
GLOBAL 00 / 17 L3 00 / 05

If necessary, I can remove the new sticks of ram to recreate the bug. I don't know the internal workings of MPI well enough to say what may be going wrong but it appears to be related to the detection of NUMA nodes.

jkarns275 avatar May 26 '22 00:05 jkarns275

Any thoughts on this? The bug still does exist, although the system configuration required to cause it is somewhat uncommon.

jkarns275 avatar Jun 02 '22 21:06 jkarns275

I'm having a similar issue with MPI_Comm_split_type on Arch Linux, not sure what package or dependency is causing it, because I cannot reproduce it on other distributions.

To easily reproduce it with Docker:

Dockerfile:

FROM archlinux

RUN pacman -Syu --noconfirm \
    base-devel \
    openmpi

COPY t1.c /
RUN mpicc t1.c && mpirun -n 2 --allow-run-as-root --oversubscribe ./a.out

t1.c:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char *argv[]) {
    MPI_Init(&argc, &argv);
    MPI_Comm mpi_comm_node;
    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
            MPI_INFO_NULL, &mpi_comm_node);
    int node_rank, node_size;
    MPI_Comm_rank(mpi_comm_node, &node_rank);
    MPI_Comm_size(mpi_comm_node, &node_size);
    printf("node_rank: %d, node_size: %d\n", node_rank, node_size);
    MPI_Finalize();
    return 0;
}

docker build . or docker buildx build . --progress=plain if using buildx

output:

node_rank: 0, node_size: 1
node_rank: 0, node_size: 1

It should print node_rank: 0 and 1, and node_size: 2.

I've also tried compiling from source OpenMPI 4.1.4 and 4.1.5 and I can reproduce it as long as I use the configure flag --with-pmix=external. If I don't use the flag, the result is correct. I have also tried with OpenMPI 5.0.0 rc10 and the bug is not reproducible regardless of the configuration.

Current versions in ArchLinux are, OpenMPI 4.1.5, OpenPMIx 4.2.3, but as I said, I cannot reproduce it with other distributions even if I use the same OpenPMIx version.

Apologies if this bug is not related to the same issue, if it's not I can create another one.

vlopezh avatar Mar 06 '23 09:03 vlopezh

I was able to reproduce the issue on a RHEL7 box with external PMIx 4.2.3.

On archlinux, I can reproduce the bug if I use a manually recompiled PMIX 4.2.3, but not with older versions (for example, 4.2.2 works just fine)

ggouaillardet avatar Mar 07 '23 01:03 ggouaillardet

@rhc54 could you please shed some light?

the locality (e.g. the pmix.loc key) is set by ess/pmi:

            kv = OBJ_NEW(opal_value_t);
            kv->key = strdup(OPAL_PMIX_LOCALITY);
            kv->type = OPAL_UINT16;
            OPAL_OUTPUT_VERBOSE((1, orte_ess_base_framework.framework_output,
                                 "%s ess:pmi:locality: proc %s locality %s",
                                 ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
                                 ORTE_NAME_PRINT(&pname), opal_hwloc_base_print_locality(u16)));
            kv->data.uint16 = u16;
            ret = opal_pmix.store_local(&pname, kv);

with PMIx 4.2.2, it is correctly retrieved in ompi/proc/proc.c:377

            /* get the locality information - all RTEs are required
             * to provide this information at startup */
            OPAL_MODEX_RECV_VALUE_OPTIONAL(ret, OPAL_PMIX_LOCALITY, &proc->super.proc_name, &u16ptr, OPAL_UINT16);

but with PMIX 4.2.3 this returns PMIX_ERR_PROC_ENTRY_NOT_FOUND and hence the locality is not set and all the MPI tasks are considered as running on different nodes by MPI_Comm_split_type().

Is this a PMIx bug? Or is Open MPI doing it wrong and we have been luck it worked so far?

ggouaillardet avatar Mar 07 '23 02:03 ggouaillardet

If it works in PMIx 4.2.2 and not with PMIx 4.2.3, then it would sound like a bug - but it seems an odd one if all other retrievals are working correctly. I honestly don't remember much about OMPI v4.x, but you do have that OPAL integration layer in-between, and it could be that something problematic is happening in there. Recall that OMPI v4 integration is to PMIx v3.x, not v4.x, so some peculiarities may be imposed - though this is a pretty trivial use-case.

I just gave it a try with PMIx v4.2.3 using a simplified PMIx-only example:

    value.type = PMIX_UINT16;
    value.data.uint16 = 123;
    PMIX_LOAD_PROCID(&proc, myproc.nspace, 100);
    PMIx_Store_internal(&proc, PMIX_LOCALITY, &value);

    PMIX_INFO_LOAD(&boo, PMIX_OPTIONAL, NULL, PMIX_BOOL);
    rc = PMIx_Get(&proc, PMIX_LOCALITY, &boo, 1, &val);
    fprintf(stderr, "LOCALITY %s\n", PMIx_Error_string(rc));

and it worked fine:

LOCALITY SUCCESS

I'd suggest perhaps doing two things:

  • compile the above simple example using mpicc (the symbols should be visible with an external PMIx) and see if it behaves as expected. I just used the examples/client.c code in PMIx and added the above right after the call to PMIx_Init (and then exited)
  • create an OMPI example that does the same, only going through the opal layer

Hope that helps

rhc54 avatar Mar 07 '23 03:03 rhc54

@rhc54 thanks!

I did a quick try but i messed up somewhere ...

anyway, I did a differential debugging, and found something that looks suspicious to me:

  • 4.2.2: PMIX_INFO_CREATE() is a macro => the flags of PMIX_OPTIONAL pmix_info_t ends up being set to PMIX_INFO_ARRAY_END and then the info is not considered as a qualifier per PMIX_INFO_IS_QUALIFIER()

  • 4.2.3: PMIX_INFO_CREATE() macro has been deprecated and PMIX_Info_create() is called. PMIX_OPTIONAL pmix_info_t flags is set to ~PMIX_INFO_PERSISTENT e.g. 0xffffffef and then the info is considered as a qualifier per PMIX_INFO_IS_QUALIFIER(). As a consequence, lookup_keyval() does execute the else part (line 577)

                /* if the stored key is also "unqualified",
                 * then return it */
                if (UINT32_MAX == d->qualindex) {
                    return d;

and the locality is not found.

ggouaillardet avatar Mar 07 '23 05:03 ggouaillardet

@rhc54 I think I see what is going on ...

we should use PMIX_INFO_LOAD(...) in pmix3x_value_load(...) instead of doing some memcpy().

I did the following hack in PMIx_Get(...) from ext3x_client.c

    if (NULL != info && 0 < (sz = opal_list_get_size(info))) {
        PMIX_INFO_CREATE(pinfo, sz);
        n=0;
        OPAL_LIST_FOREACH(ival, info, opal_value_t) {
            if (ival->type == OPAL_BOOL) {
                PMIX_INFO_LOAD(pinfo + n, ival->key, &ival->data, PMIX_BOOL);
            } else {
                (void)strncpy(pinfo[n].key, ival->key, PMIX_MAX_KEYLEN);
                ext3x_value_load(&pinfo[n].value, ival);
            }
            ++n;
        }
    }

and it was good enough to run this program correctly.

At first glance, a full fix should work with PMIx v3 and v4.

I noted the PMIX_INFO_LOAD(...) macro has been moved into pmix_deprecated.h, does it mean this is not the best way to do things?

ggouaillardet avatar Mar 07 '23 07:03 ggouaillardet

The standard converted the macros to functions so someone wanting to dlopen/dlsym the library could use them. I would stick with the macros myself - they aren't going anywhere as they are required for backward compatibility.

I wouldn't have the if/else clause in the loop - just use PMIX_INFO_LOAD for all the values.

It sounds like we do have a bug in 4.2.3 - PMIX_OPTIONAL has nothing to do with persistence. I'll take a look at it. Thanks for pointing it out!

rhc54 avatar Mar 07 '23 14:03 rhc54

PMIX_OPTIONAL pmix_info_t flags is set to ~PMIX_INFO_PERSISTENT

I believe that was a typo - it is set to ~PMIX_INFO_REQD, which should be correct. I'm still checking to see why that would be picked up as a qualifier

rhc54 avatar Mar 07 '23 14:03 rhc54

4.2.3: PMIX_INFO_CREATE() macro has been deprecated and PMIX_Info_create() is called. PMIX_OPTIONAL pmix_info_t flags is set to ~PMIX_INFO_PERSISTENT e.g. 0xffffffef and then the info is considered as a qualifier per PMIX_INFO_IS_QUALIFIER()

I can't find a problem here. I think you got confused when you looked at "persistent" instead of "reqd" on the info directive. With the typo corrected, it looks right to me.

rhc54 avatar Mar 07 '23 14:03 rhc54

I did find a bug when we initialize a pmix_info_t - we set it to ~PMIX_INFO_REQD, which turns on all the other flag directives and makes it look like a qualifier. Thought I had fixed that some time ago, but I guess not!

rhc54 avatar Mar 07 '23 15:03 rhc54

Aha - I found the problem! We aren't correctly testing the bits in the pmix_info_t flags - what we did was fine when we only had one bit set at a time, but now we are setting multiple bits. The methods we employed are the source of the problem you are seeing.

I'm fixing this now and it will be in v4.2.4. However, I think fixing the OMPI integration is also a good thing to do.

rhc54 avatar Mar 07 '23 17:03 rhc54

What's the status of this issue? The current release on Arch Linux is broken for several months now, the pull request #11472 seems to be stuck and new releases of openmpi and openpmix which are supposed to fix this are still not out...

lahwaacz avatar Apr 18 '23 21:04 lahwaacz

Pretty sure all you have to do is back PMIx down to something earlier than 4.2.3.

rhc54 avatar Apr 19 '23 02:04 rhc54

@rhc54 Is that a recommendation for the whole distro? If so, why do you keep a stable release that should not be used? It is not a real solution.

lahwaacz avatar Apr 19 '23 08:04 lahwaacz