ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Problems when running examples hello_c

Open shiwch opened this issue 3 years ago • 14 comments

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from a source/distribution tarball

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.6.1810 (AltArch) Linux version 4.14.0-115.el7a.0.1.aarch64 ([email protected])
  • Computer hardware:
[nscc-gz@centos203 examples]$ lscpu 
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    1
Core(s) per socket:    64
Socket(s):             2
NUMA node(s):          4
Model:                 0
BogoMIPS:              200.00
L1d cache:             64K
L1i cache:             64K
L2 cache:              512K
L3 cache:              65536K
NUMA node0 CPU(s):     0-31
NUMA node1 CPU(s):     32-63
NUMA node2 CPU(s):     64-95
NUMA node3 CPU(s):     96-127
Flags:                 fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop
  • Network type:

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Hi,When I running the hello_c,I get the following output

[nscc-gz@centos203 examples]$ mpirun -np 4  --mca orte_base_help_aggregate 0
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           centos203
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           centos203
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           centos203
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           centos203
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)

and the ibstat output

[nscc-gz@centos203 examples]$ ibstat
CA 'mlx5_0'
        CA type: MT4117
        Number of ports: 1
        Firmware version: 14.20.1820
        Hardware version: 0
        Node GUID: 
        System image GUID: 
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 25
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 
                Port GUID: 
                Link layer: Ethernet
CA 'mlx5_1'
        CA type: MT4117
        Number of ports: 1
        Firmware version: 14.20.1820
        Hardware version: 0
        Node GUID: 
        System image GUID: 
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 
                Port GUID: 
                Link layer: Ethernet

if I use this command

[nscc-gz@centos203 examples]$ mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm  -np 4  hello_c 
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           centos203
  Local device:         
  Local port:           1
  CPCs attempted:       rdmacm
--------------------------------------------------------------------------
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
[centos203:10977] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[centos203:10977] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

And if i designate the ib device

[nscc-gz@centos203 examples]$ mpirun -np 4 ./hello_c --mca btl_openib_if_exclude mlx5_0
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: centos203
--------------------------------------------------------------------------
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
[centos203:14896] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found
[centos203:14896] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[nscc-gz@centos203 examples]$ 

The ifconfig and ib port

[nscc-gz@centos203 examples]$ ibdev2netdev
mlx5_0 port 1 ==> enp1s0f0 (Up)
mlx5_1 port 1 ==> enp1s0f1 (Down)
[nscc-gz@centos203 examples]$ ifconfig
enp125s0f0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether   txqueuelen 1000  (Ethernet)
        RX packets 1951751285  bytes 2352472322729 (2.1 TiB)
        RX errors 0  dropped 11718888  overruns 0  frame 0
        TX packets 822856179  bytes 1385364963277 (1.2 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp125s0f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.16.29.130  netmask 255.255.255.0  broadcast 172.16.29.255
        inet6   prefixlen 64  scopeid 0x20<link>
        ether   txqueuelen 1000  (Ethernet)
        RX packets 19347918  bytes 7289410117 (6.7 GiB)
        RX errors 0  dropped 2958451  overruns 0  frame 0
        TX packets 12963627  bytes 48203399135 (44.8 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp125s0f2: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether   txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp125s0f3: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether   txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp1s0f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.40.1.203  netmask 255.255.255.0  broadcast 10.40.1.255
        inet6   prefixlen 64  scopeid 0x20<link>
        ether   txqueuelen 1000  (Ethernet)
        RX packets 382158355530  bytes 544487865896139 (495.2 TiB)
        RX errors 208  dropped 3083040  overruns 0  frame 208
        TX packets 379357423669  bytes 545429402206655 (496.0 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp1s0f1: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        ether  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 32048729965  bytes 809795856471103 (736.5 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 32048729965  bytes 809795856471103 (736.5 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[nscc-gz@centos203 examples]$ 

could you tell me how can I use the IB devices correctly? Thanks!

shiwch avatar Nov 12 '22 13:11 shiwch

FYI @open-mpi/ucx team

jsquyres avatar Nov 14 '22 17:11 jsquyres

@shiwch is running ucx pml an option? If so, what if you run with -mca pml ucx ?

janjust avatar Nov 14 '22 17:11 janjust

@janjust I get the same warning if i run with -mca pml ucx

[nscc-gz@centos203 examples]$ mpirun  -np 4 -mca pml ucx -mca btl_openib_if_exclude mlx5_0  ./hello_c
--------------------------------------------------------------------------
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: centos203
--------------------------------------------------------------------------
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
[centos203:111249] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found
[centos203:111249] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[nscc-gz@centos203 examples]$ 

shiwch avatar Nov 15 '22 11:11 shiwch

@shiwch ok one more try -mca btl ^openib

janjust avatar Nov 15 '22 14:11 janjust

@janjust Thanks! this command could run correctly. But I have another question, is that important? I mean whether there is an unknown performance penalty whitout supporting openib. It seems to work for communication?

[nscc-gz@centos203 examples]$ mpirun -np 4 -mca btl ^openib -mca pml ucx -mca btl_openib_if_exclude mlx5_0  ./hello_c
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
[nscc-gz@centos203 examples]$ 

shiwch avatar Nov 15 '22 14:11 shiwch

@shiwch How was Open MPI installed? What was the configure command?

I'm guessing in your case, because openib cannot be selected you're falling back to ucx.

janjust avatar Nov 15 '22 14:11 janjust

@janjust I installed openmpi with these commands.

wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.2.tar.gz
tar -xvf openmpi-4.1.2.tar.gz
cd openmpi-4.1.2 && mkdir build && cd build
../configure --prefix=/home/nscc-gz/shi/mpi/openmpi-4.1.2
make -j 16 && make install

shiwch avatar Nov 15 '22 15:11 shiwch

@shiwch Which mofed version? $ofed_info -s

janjust avatar Nov 15 '22 15:11 janjust

@janjust

[nscc-gz@centos203 examples]$ ofed_info -s
MLNX_OFED_LINUX-4.7-3.2.9.0:
[nscc-gz@centos203 examples]$

shiwch avatar Nov 15 '22 15:11 shiwch

@shiwch one more command $show_gids

janjust avatar Nov 15 '22 15:11 janjust

@janjust

[nscc-gz@centos203 examples]$ show_gids
DEV     PORT    INDEX   GID                                     IPv4            VER     DEV
---     ----    -----   ---                                     ------------    ---     ---
mlx5_0  1       0       fe80:0000:0000:0000:526b:4bff:fe43:a96e                 v1      enp1s0f0
mlx5_0  1       1       fe80:0000:0000:0000:526b:4bff:fe43:a96e                 v2      enp1s0f0
mlx5_0  1       2       fe80:0000:0000:0000:882e:96dd:e5e7:0477                 v1      enp1s0f0
mlx5_0  1       3       fe80:0000:0000:0000:882e:96dd:e5e7:0477                 v2      enp1s0f0
mlx5_0  1       4       0000:0000:0000:0000:0000:ffff:0a28:01cb 10.40.1.203     v1      enp1s0f0
mlx5_0  1       5       0000:0000:0000:0000:0000:ffff:0a28:01cb 10.40.1.203     v2      enp1s0f0
mlx5_1  1       0       fe80:0000:0000:0000:526b:4bff:fe43:a96f                 v1      enp1s0f1
mlx5_1  1       1       fe80:0000:0000:0000:526b:4bff:fe43:a96f                 v2      enp1s0f1
n_gids_found=8
[nscc-gz@centos203 examples]$

shiwch avatar Nov 15 '22 15:11 shiwch

@shiwch A shot in the dark but please try mpirun -np 4 -mca btl_openib_warn_default_gid_prefix 4 -mca btl_openib_if_include mlx5_0:1 ./hello_c if not 4, try gid 5. For some reason openib is not geting selected because it cannot find the correct device/port to use. Looks like a configuration issue to me.

But irrespective of that, I would try to run also with verbose, because I'm guessing ucx is selected by default, so it wouldn't matter if you disbaled openib to get rid of the warning message.

janjust avatar Nov 15 '22 16:11 janjust

@janjust Okay, thanks for your help! By the way, I tried the grid 4 and 5, but got the same warning as before.

[nscc-gz@centos203 examples]$ mpirun -np 4 -mca btl_openib_warn_default_gid_prefix 4 -mca btl_openib_if_include mlx5_0:1 ./hello_c
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           centos203
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
[centos203:02086] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[centos203:02086] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[nscc-gz@centos203 examples]$ mpirun -np 4 -mca btl_openib_warn_default_gid_prefix 5 -mca btl_openib_if_include mlx5_0:1 ./hello_c
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           centos203
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
Hello, world, I am 2 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 1 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 3 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
Hello, world, I am 0 of 4, (Open MPI v4.1.2, package: Open MPI nscc-gz@centos203 Distribution, ident: 4.1.2, repo rev: v4.1.2, Nov 24, 2021, 112)
[centos203:02141] 3 more processes have sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[centos203:02141] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[nscc-gz@centos203 examples]$ 

shiwch avatar Nov 15 '22 17:11 shiwch

How do you solve this problem? I'm having the same problem.

heilengleng avatar Oct 22 '24 00:10 heilengleng