ompi icon indicating copy to clipboard operation
ompi copied to clipboard

ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c

Open RahulKulhari opened this issue 6 years ago • 6 comments

Open MPI Version: v4.0.0

Output of ompi_info | head on two machine

mpiuser@s2:~$ ssh s1 ompi_info | head
                 Package: Open MPI mpiuser@s1 Distribution
                Open MPI: 4.0.0
  Open MPI repo revision: v4.0.0
   Open MPI release date: Nov 12, 2018
                Open RTE: 4.0.0
  Open RTE repo revision: v4.0.0
   Open RTE release date: Nov 12, 2018
                    OPAL: 4.0.0
      OPAL repo revision: v4.0.0
       OPAL release date: Nov 12, 2018
mpiuser@s2:~$ ompi_info | head
                 Package: Open MPI mpiuser@s2 Distribution
                Open MPI: 4.0.0
  Open MPI repo revision: v4.0.0
   Open MPI release date: Nov 12, 2018
                Open RTE: 4.0.0
  Open RTE repo revision: v4.0.0
   Open RTE release date: Nov 12, 2018
                    OPAL: 4.0.0
      OPAL repo revision: v4.0.0
       OPAL release date: Nov 12, 2018

Both are installed using common shared network.

while running command on s1(master)

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -n 2 ./hello
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)

while running command separately in s2(slave)

mpiuser@s2:~/cloud$ mpirun -n 2 ./hello
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)

Output of hwloc command on s2:

mpiuser@s2:~/cloud/openmpi-4.0.0$ dpkg -l | grep hwloc
mpiuser@s2:~/cloud/openmpi-4.0.0$

Output of hwloc command on s1:

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ dpkg -l | grep hwloc
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$

Both machines are running on Ubuntu 16.04.5 LTS

but while running command on distributed giving following error

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -host s1,s2 ./hello
[s2:26283] [[40517,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[40517,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

RahulKulhari avatar Sep 26 '19 11:09 RahulKulhari

Are you sure you are running the very same hello program on both hosts?

You can double check this by running

mpirun -host s1,s2 --tag-output md5sum ./hello

ggouaillardet avatar Sep 26 '19 12:09 ggouaillardet

@ggouaillardet

yes both are running same hello program in same shared directory over ethernet.

Please find output:

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -host s1,s2 --tag-output md5sum ./hello
[s2:25690] [[63978,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[63978,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.
--------------------------------------------------------------------------
[1,0]<stdout>:898a451fe3b1993698530c899dcde5ad  ./hello

RahulKulhari avatar Sep 26 '19 12:09 RahulKulhari

What if you run

`which mpirun` --host s1,s2 true

ggouaillardet avatar Sep 26 '19 14:09 ggouaillardet

FWIW: this error almost always means that the OMPI version on the remote node is different from that on the node where mpirun is executing.

rhc54 avatar Sep 26 '19 14:09 rhc54

Used these instructions for install https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX Getting a similar error, works okay for -np 16:

rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$ /home/rocm/ompi/ompiinstall/bin/mpirun  -np 16 --host localhost,remotehost--oversubscribe -mca pml ucx  -x UCX_IB_ADDR_TYPE=ib_global  /home/rocm/hiplammps20191206/lammps/src/lmp_hip -in /home/rocm/hiplammps20191206/lammps/examples/melt/in.4m.melt
LAMMPS (19 Jul 2019)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (167.96 167.96 167.96)
  2 by 2 by 4 MPI processor grid
Created 4000000 atoms

Similar error occurs after increasing to -np 32

rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$ /home/rocm/ompi/ompiinstall/bin/mpirun  -np 32 --host localhost,remotehost--oversubscribe -mca pml ucx  -x UCX_IB_ADDR_TYPE=ib_global  /home/rocm/hiplammps20191206/lammps/src/lmp_hip -in /home/rocm/hiplammps20191206/lammps/examples/melt/in.4m.melt
[prj47-rack-69:63575] [[3722,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c at line 351
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[3722,0],1] FORCE-TERMINATE AT ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c:355 - error Data unpack would read past end of buffer(-26)

This is something that should be reported to the developers.

Simple 'which mpirun' with -np 32 works

rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$ /home/rocm/ompi/ompiinstall/bin/mpirun  -np 32 --host localhost,remotehost--oversubscribe -mca pml ucx  -x UCX_IB_ADDR_TYPE=ib_global  which mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$

Slight difference in MPI builds (7298 versus 7296):

rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$ ompi_info
                 Package: Open MPI rocm@prj47-rack-39 Distribution
                Open MPI: 4.1.0a1
  Open MPI repo revision: v2.x-dev-7298-gcdf46e6
   Open MPI release date: Unreleased developer copy
                Open RTE: 4.1.0a1
  Open RTE repo revision: v2.x-dev-7298-gcdf46e6
   Open RTE release date: Unreleased developer copy
                    OPAL: 4.1.0a1
      OPAL repo revision: v2.x-dev-7298-gcdf46e6
       OPAL release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 4.1.0a1
                  Prefix: /home/rocm/ompi/ompiinstall
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: prj47-rack-39
           Configured by: rocm
           Configured on: Fri Dec  6 11:39:53 PST 2019
          Configure host: prj47-rack-39


rocm@prj47-rack-69:~$
rocm@prj47-rack-69:~$ ompi_info
                 Package: Open MPI rocm@prj47-rack-69 Distribution
                Open MPI: 4.1.0a1
  Open MPI repo revision: v2.x-dev-7296-g37f5079
   Open MPI release date: Unreleased developer copy
                Open RTE: 4.1.0a1
  Open RTE repo revision: v2.x-dev-7296-g37f5079
   Open RTE release date: Unreleased developer copy
                    OPAL: 4.1.0a1
      OPAL repo revision: v2.x-dev-7296-g37f5079
       OPAL release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 4.1.0a1
                  Prefix: /home/rocm/ompi/ompiinstall
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: prj47-rack-69
           Configured by: rocm
           Configured on: Fri Dec  6 10:03:38 PST 2019
          Configure host: prj47-rack-69

http://davidjyoung.com/ompi/orted.error.strace.txt Thanks.

djygithub avatar Dec 10 '19 16:12 djygithub

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

github-actions[bot] avatar Feb 16 '24 17:02 github-actions[bot]

Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.

I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!

github-actions[bot] avatar Mar 01 '24 17:03 github-actions[bot]