ompi
ompi copied to clipboard
ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c
Open MPI Version: v4.0.0
Output of ompi_info | head on two machine
mpiuser@s2:~$ ssh s1 ompi_info | head
Package: Open MPI mpiuser@s1 Distribution
Open MPI: 4.0.0
Open MPI repo revision: v4.0.0
Open MPI release date: Nov 12, 2018
Open RTE: 4.0.0
Open RTE repo revision: v4.0.0
Open RTE release date: Nov 12, 2018
OPAL: 4.0.0
OPAL repo revision: v4.0.0
OPAL release date: Nov 12, 2018
mpiuser@s2:~$ ompi_info | head
Package: Open MPI mpiuser@s2 Distribution
Open MPI: 4.0.0
Open MPI repo revision: v4.0.0
Open MPI release date: Nov 12, 2018
Open RTE: 4.0.0
Open RTE repo revision: v4.0.0
Open RTE release date: Nov 12, 2018
OPAL: 4.0.0
OPAL repo revision: v4.0.0
OPAL release date: Nov 12, 2018
Both are installed using common shared network.
while running command on s1(master)
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -n 2 ./hello
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)
while running command separately in s2(slave)
mpiuser@s2:~/cloud$ mpirun -n 2 ./hello
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)
Output of hwloc command on s2:
mpiuser@s2:~/cloud/openmpi-4.0.0$ dpkg -l | grep hwloc
mpiuser@s2:~/cloud/openmpi-4.0.0$
Output of hwloc command on s1:
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ dpkg -l | grep hwloc
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$
Both machines are running on Ubuntu 16.04.5 LTS
but while running command on distributed giving following error
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -host s1,s2 ./hello
[s2:26283] [[40517,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:
[[40517,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
Are you sure you are running the very same hello program on both hosts?
You can double check this by running
mpirun -host s1,s2 --tag-output md5sum ./hello
@ggouaillardet
yes both are running same hello program in same shared directory over ethernet.
Please find output:
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -host s1,s2 --tag-output md5sum ./hello
[s2:25690] [[63978,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:
[[63978,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)
This is something that should be reported to the developers.
--------------------------------------------------------------------------
[1,0]<stdout>:898a451fe3b1993698530c899dcde5ad ./hello
What if you run
`which mpirun` --host s1,s2 true
FWIW: this error almost always means that the OMPI version on the remote node is different from that on the node where mpirun is executing.
Used these instructions for install https://github.com/openucx/ucx/wiki/OpenMPI-and-OpenSHMEM-installation-with-UCX Getting a similar error, works okay for -np 16:
rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$ /home/rocm/ompi/ompiinstall/bin/mpirun -np 16 --host localhost,remotehost--oversubscribe -mca pml ucx -x UCX_IB_ADDR_TYPE=ib_global /home/rocm/hiplammps20191206/lammps/src/lmp_hip -in /home/rocm/hiplammps20191206/lammps/examples/melt/in.4m.melt
LAMMPS (19 Jul 2019)
Lattice spacing in x,y,z = 1.6796 1.6796 1.6796
Created orthogonal box = (0 0 0) to (167.96 167.96 167.96)
2 by 2 by 4 MPI processor grid
Created 4000000 atoms
Similar error occurs after increasing to -np 32
rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$ /home/rocm/ompi/ompiinstall/bin/mpirun -np 32 --host localhost,remotehost--oversubscribe -mca pml ucx -x UCX_IB_ADDR_TYPE=ib_global /home/rocm/hiplammps20191206/lammps/src/lmp_hip -in /home/rocm/hiplammps20191206/lammps/examples/melt/in.4m.melt
[prj47-rack-69:63575] [[3722,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c at line 351
--------------------------------------------------------------------------
An internal error has occurred in ORTE:
[[3722,0],1] FORCE-TERMINATE AT ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c:355 - error Data unpack would read past end of buffer(-26)
This is something that should be reported to the developers.
Simple 'which mpirun' with -np 32 works
rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$ /home/rocm/ompi/ompiinstall/bin/mpirun -np 32 --host localhost,remotehost--oversubscribe -mca pml ucx -x UCX_IB_ADDR_TYPE=ib_global which mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
/home/rocm/ompi/ompiinstall/bin/mpirun
rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$
Slight difference in MPI builds (7298 versus 7296):
rocm@prj47-rack-39:~/hiplammps20191206/lammps/examples/melt$ ompi_info
Package: Open MPI rocm@prj47-rack-39 Distribution
Open MPI: 4.1.0a1
Open MPI repo revision: v2.x-dev-7298-gcdf46e6
Open MPI release date: Unreleased developer copy
Open RTE: 4.1.0a1
Open RTE repo revision: v2.x-dev-7298-gcdf46e6
Open RTE release date: Unreleased developer copy
OPAL: 4.1.0a1
OPAL repo revision: v2.x-dev-7298-gcdf46e6
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 4.1.0a1
Prefix: /home/rocm/ompi/ompiinstall
Configured architecture: x86_64-pc-linux-gnu
Configure host: prj47-rack-39
Configured by: rocm
Configured on: Fri Dec 6 11:39:53 PST 2019
Configure host: prj47-rack-39
rocm@prj47-rack-69:~$
rocm@prj47-rack-69:~$ ompi_info
Package: Open MPI rocm@prj47-rack-69 Distribution
Open MPI: 4.1.0a1
Open MPI repo revision: v2.x-dev-7296-g37f5079
Open MPI release date: Unreleased developer copy
Open RTE: 4.1.0a1
Open RTE repo revision: v2.x-dev-7296-g37f5079
Open RTE release date: Unreleased developer copy
OPAL: 4.1.0a1
OPAL repo revision: v2.x-dev-7296-g37f5079
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 4.1.0a1
Prefix: /home/rocm/ompi/ompiinstall
Configured architecture: x86_64-pc-linux-gnu
Configure host: prj47-rack-69
Configured by: rocm
Configured on: Fri Dec 6 10:03:38 PST 2019
Configure host: prj47-rack-69
http://davidjyoung.com/ompi/orted.error.strace.txt Thanks.
It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.
Per the above comment, it has been a month with no reply on this issue. It looks like this issue has been abandoned.
I'm going to close this issue. If I'm wrong and this issue is not abandoned, please feel free to re-open it. Thank you!