ompi
ompi copied to clipboard
Return error from node failure
Background information
What version of Open MPI are you using?
v5.0.0rc7
Describe how Open MPI was installed
tarball
Please describe the system on which you are running
- Operating system/version: Linux 4.19.0-18-cloud-amd64 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
- Network type: TCP/IP
Details of the problem
I am trying to make a distributed system built on OpenMPI continue past a node failure. In order to do this I must detect and handle a node failure.
I am using OpenMPI v5rc7, run with "--with-ft ulfm", and have set "MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN)". It seems the node failure is not returned as an error that can be handled in the code.
Example:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
int comm_size;
MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
int window_buffer = 0;
if (my_rank == 1)
{
window_buffer = 12345;
}
MPI_Win window;
MPI_Win_create(&window_buffer, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
MPI_Win_fence(0, window);
int value_fetched;
if(my_rank == 0)
{
// Network fails. Attempt to fetch the value from the MPI process 1 window
system("sudo iptables -A OUTPUT -d 10.166.0.18 -j DROP");
system("sudo iptables -A INPUT -s 10.166.0.18 -j DROP");
int err = MPI_Get(&value_fetched, 1, MPI_INT, 1, 0, 1, MPI_INT, window);
// Handle error
if (err)
{
printf("Received error from MPI_Get: %d\n", err);
}
// reset firewall
system("sudo iptables --flush");
}
MPI_Win_fence(0, window);
MPI_Win_free(&window);
MPI_Finalize();
return EXIT_SUCCESS;
}
$ /home/ompi5rc7/bin/mpic++ example.cpp
$ /home/ompi5rc7/bin/mpirun --with-ft ulfm -n 2 --hostfile ../hosts ./a.out
--------------------------------------------------------------------------
WARNING: The selected 'osc' module 'rdma' is not tested for post-failure
operation, yet you have requested support for fault tolerance.
When using this component, normal failure free operation is expected;
However, failures may cause the application to abort, crash or deadlock.
In this framework, the following components are tested to operate under
failure scenarios: {}
--------------------------------------------------------------------------
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef
< long wait here >
--------------------------------------------------------------------------
Sorry! You were supposed to get help about:
node-died
But I couldn't open the help file:
(null). Sorry!
--------------------------------------------------------------------------
I have also tried running with "/home/ompi5rc7/bin/mpirun --with-ft ulfm --mca btl tcp,self -n 2 --hostfile ../hosts ./a.out" but get the same output. I am not using RDMA.
Is it possible to print out the error code after a node failure?
As the large banner on your output indicates, using ULFM with RMA windows is very experimental at this point.
We have had success in the past running some code using the osc_pt2pt
component, but this component is currently removed. The code reports you are using the osc_rdma
component, which has not been tested and may very well deadlock when faults happen in a variety of situations.
You may still have luck modifying your test program in the following ways:
- you need to set an error handler on the Window to capture errors during GET/PUT/FENCE calls: use MPI_WIN_SET_ERRHANDLER (by default, the error handler on windows is MPI_ERRORS_ARE_FATAL, and is not modified when changing the error handler on MPI_COMM_WORLD)
- capture the ret from the MPI_FENCE operations as well
- The way you inject your failures (e.g., essentially making the node network unreachable) means that you may detect the failure only after TCP timeouts, which can take up to 1 hour. You can inject failures by adding a
raise(SIGKILL)
from the code of one of your ranks, rather than blocking TCP traffic: the infrastructure will detect the failed process immediately and propagate the fault much quicker than TCP timeouts.
@abouteiller Can you look into the "Sorry! You were supposed to get help..." issue? It seems like there was supposed to be a real help message there.
The 'node-died' issue appears to be related to prted
not finding its own files share/prte/help-errmgr-base.txt
Are you using an internal prte? (it is indicated in the final lines of 'configure' output)
I am using an internal prte. I tried adding the path to the openmpi installation to PATH, but that did not fix the missing error text. It's not a problem though.
Calling
MPI_Win_set_errhandler(window, MPI_ERRORS_RETURN);
after creating the window does not fix the problem of the node crash not being returned as an error.
It seems the program crashes once the MPI_Get attempts to access a node that is inaccessible (node-died error) or has raised a SIGKILL (fails silently).
Is it safe to say that I should wait for the full OpenMPI 5 release?
Thank you for your time.
Side note: I do not have RDMA installed, and when I run the program without node failures it runs correctly, so perhaps RDMA is being reported in error?
The root cause for the missing help message is here https://github.com/openpmix/prrte/issues/1360
FWIW: the missing help message problem has been fixed in PRRTE (both master and release branches).