Stuck in third send or recv when connecting two independent mpi applications with MPI_Comm_connect and MPI_Comm_accept with prte
Background information
Open MPI v5.0.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.3.tar.gz
tar zxf openmpi-5.0.3.tar.gz
ln -s openmpi-5.0.3 openmpi
cd openmpi
./configure --prefix=/home/lab/bin/openmpi
make -j $(nproc) all
make install
Please describe the system on which you are running
- Operating system/version: Ubuntu 20.04.5 LTS
- Computer hardware: irrelevant
- Network type: irrelevant, in local
Details of the problem
I have tried to use a client server with MPI_Comm_connect and MPI_Comm_accept. And the connection is fine, but on the third message it gets stuck. I followed the steps in the documentation: https://docs.open-mpi.org/en/v5.0.x/launching-apps/unusual.html#connecting-independent-mpi-applications
The steps:
Terminal 1:
Command:
prte --report-uri uri.txt
Output:
DVM ready
Terminal 2:
Command:
mpicc -o server server.c
mpiexec -np 1 --dvm file:uri.txt ./server
Output:
225771546.0:3674309388
Error ret 0: MPI_SUCCESS: no errors
Error status 0: MPI_SUCCESS: no errors
Recv: 1
Error ret 0: MPI_SUCCESS: no errors
Error status 0: MPI_SUCCESS: no errors
Recv: 2
Terminal 3:
Command:
mpicc -o client client.c
mpiexec -np 1 --dvm file:uri.txt ./client 225771546.0:3674309388
Output:
Error ret 0: MPI_SUCCESS: no errors
Send: 1
Error ret 0: MPI_SUCCESS: no errors
Send: 2
Code used:
server.c:
#include <stdio.h>
#include "mpi.h"
int main( int argc, char **argv )
{
MPI_Comm client;
MPI_Status status;
char port_name[MPI_MAX_PORT_NAME];
int rank, nprocs, error, eclass, len, ret;
char estring[MPI_MAX_ERROR_STRING];
MPI_Init(&argc, &argv);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
MPI_Open_port(MPI_INFO_NULL, port_name);
printf("%s\n",port_name);
MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &client );
int msg1;
int msg2;
int msg3;
ret = MPI_Recv(&msg1, 1, MPI_INT, MPI_ANY_SOURCE, 0, client, &status);
MPI_Error_class(ret, &eclass);
MPI_Error_string(ret, estring, &len);
printf("Error ret %d: %s\n", eclass, estring);
MPI_Error_class(status.MPI_ERROR, &eclass);
MPI_Error_string(status.MPI_ERROR, estring, &len);
printf("Error status %d: %s\n", eclass, estring);
printf("Recv: %d\n", msg1);fflush(stdout);
ret = MPI_Recv(&msg2, 1, MPI_INT, MPI_ANY_SOURCE, 0, client, &status);
MPI_Error_class(ret, &eclass);
MPI_Error_string(ret, estring, &len);
printf("Error ret %d: %s\n", eclass, estring);
MPI_Error_class(status.MPI_ERROR, &eclass);
MPI_Error_string(status.MPI_ERROR, estring, &len);
printf("Error status %d: %s\n", eclass, estring);
printf("Recv: %d\n", msg2);fflush(stdout);
ret = MPI_Recv(&msg3, 1, MPI_INT, MPI_ANY_SOURCE, 0, client, &status);
MPI_Error_class(ret, &eclass);
MPI_Error_string(ret, estring, &len);
printf("Error ret %d: %s\n", eclass, estring);
MPI_Error_class(status.MPI_ERROR, &eclass);
MPI_Error_string(status.MPI_ERROR, estring, &len);
printf("Error status %d: %s\n", eclass, estring);
printf("Recv: %d\n", msg3);fflush(stdout);
MPI_Comm_disconnect( &client );
MPI_Close_port(port_name);
MPI_Finalize();
return 0;
}
client.c:
#include "mpi.h"
#include <string.h>
#include <stdio.h>
int main( int argc, char **argv )
{
MPI_Comm server;
char port_name[MPI_MAX_PORT_NAME];
int ret, eclass, len;
char estring[MPI_MAX_ERROR_STRING];
MPI_Init( &argc, &argv );
MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
strcpy(port_name, argv[1] );/* assume server's name is cmd-line arg */
int msg1 = 1;
int msg2 = 2;
int msg3 = 3;
MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &server );
ret = MPI_Send(&msg1, 1, MPI_INT, 0, 0, server);
MPI_Error_class(ret, &eclass);
MPI_Error_string(ret, estring, &len);
printf("Error ret %d: %s\n", eclass, estring);
printf("Send: %d\n", msg1);fflush(stdout);
ret = MPI_Send(&msg2, 1, MPI_INT, 0, 0, server);
MPI_Error_class(ret, &eclass);
MPI_Error_string(ret, estring, &len);
printf("Error ret %d: %s\n", eclass, estring);
printf("Send: %d\n", msg2);fflush(stdout);
ret = MPI_Send(&msg3, 1, MPI_INT, 0, 0, server);
MPI_Error_class(ret, &eclass);
MPI_Error_string(ret, estring, &len);
printf("Error ret %d: %s\n", eclass, estring);
printf("Send: %d\n", msg3);fflush(stdout);
MPI_Comm_disconnect( &server );
MPI_Finalize();
return 0;
}
Your code works just fine on my setup. My main issue arise from the fact that prte DVM decide to only use the 'lo' interface which forces me to start all processes from the same host.
Can I get more feedback, such as how do I debug the app to find out why it's stuck?
As a first step you should attach to the running processes with a debugger (gdb -p) and take a look at the stack trace to see where exactly the processes are blocked.
As I said before the program gets stuck in the third Send-Recv, you can see it in the gdb backtrace as you told me to debug. gdb server:
gdb -p 593951
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 593951
[New LWP 593952]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fd8eed3546e in epoll_wait (epfd=3, events=0x564a1bcc7810, maxevents=32, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
(gdb) bt
#0 0x00007fd8eed3546e in epoll_wait (epfd=3, events=0x564a1bcc7810, maxevents=32, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007fd8ee988469 in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#2 0x00007fd8ee97e4a5 in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#3 0x00007fd8eeb3bb53 in opal_progress_events () from /beegfs/home/dariomnz/bin/ompi5/lib/libopen-pal.so.80
#4 0x00007fd8eeb3bc25 in opal_progress () from /beegfs/home/dariomnz/bin/ompi5/lib/libopen-pal.so.80
#5 0x00007fd8ef06ff60 in mca_pml_ob1_recv () from /beegfs/home/dariomnz/bin/ompi5/lib/libmpi.so.40
#6 0x00007fd8eeeea9ef in PMPI_Recv () from /beegfs/home/dariomnz/bin/ompi5/lib/libmpi.so.40
#7 0x000056433f13e5fd in main (argc=1, argv=0x7ffefe82ca88) at server.c:45
(gdb)
Client:
gdb -p 593961
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 593961
[New LWP 593962]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007faae290f46e in epoll_wait (epfd=3, events=0x555fdbb5b810, maxevents=32, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30 ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
(gdb) bt
#0 0x00007faae290f46e in epoll_wait (epfd=3, events=0x555fdbb5b810, maxevents=32, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007faae2562469 in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#2 0x00007faae25584a5 in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#3 0x00007faae2715b53 in opal_progress_events () from /beegfs/home/dariomnz/bin/ompi5/lib/libopen-pal.so.80
#4 0x00007faae2715c25 in opal_progress () from /beegfs/home/dariomnz/bin/ompi5/lib/libopen-pal.so.80
#5 0x00007faae2c4e313 in mca_pml_ob1_send () from /beegfs/home/dariomnz/bin/ompi5/lib/libmpi.so.40
#6 0x00007faae2aca1f3 in PMPI_Send () from /beegfs/home/dariomnz/bin/ompi5/lib/libmpi.so.40
#7 0x0000562e06ef4500 in main (argc=2, argv=0x7ffdf534e6a8) at client.c:35
(gdb)
Can I have more information on how to continue debugging to make it work?
After much trial and error the problem was that I needed to compile openmpi with slurm (--with-slurm=/opt/slurm), otherwise this behavior of getting stuck on the third send would happen.