ompi Return error from node failure

Background information

What version of Open MPI are you using?

v5.0.0rc7

Describe how Open MPI was installed

tarball

Please describe the system on which you are running

Operating system/version: Linux 4.19.0-18-cloud-amd64 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
Network type: TCP/IP

Details of the problem

I am trying to make a distributed system built on OpenMPI continue past a node failure. In order to do this I must detect and handle a node failure.

I am using OpenMPI v5rc7, run with "--with-ft ulfm", and have set "MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN)". It seems the node failure is not returned as an error that can be handled in the code.

Example:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
 
int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
    int comm_size;
    MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

    MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
 
    int window_buffer = 0;
    if (my_rank == 1)
    {
        window_buffer = 12345;
    } 

    MPI_Win window;
    MPI_Win_create(&window_buffer, sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &window);
    MPI_Win_fence(0, window);
 
    int value_fetched;
    if(my_rank == 0)
    {
        // Network fails. Attempt to fetch the value from the MPI process 1 window
        system("sudo iptables -A OUTPUT -d 10.166.0.18 -j  DROP");
        system("sudo iptables -A INPUT -s 10.166.0.18 -j DROP");
        int err = MPI_Get(&value_fetched, 1, MPI_INT, 1, 0, 1, MPI_INT, window);

        // Handle error
        if (err)
        {
            printf("Received error from MPI_Get: %d\n", err);
        }
        // reset firewall
        system("sudo iptables --flush");
    }
 
    MPI_Win_fence(0, window);
    MPI_Win_free(&window); 
    MPI_Finalize();
    return EXIT_SUCCESS;
}

$ /home/ompi5rc7/bin/mpic++ example.cpp
$ /home/ompi5rc7/bin/mpirun --with-ft ulfm -n 2 --hostfile ../hosts ./a.out
--------------------------------------------------------------------------
WARNING: The selected 'osc' module 'rdma' is not tested for post-failure
operation, yet you have requested support for fault tolerance.
When using this component, normal failure free operation is expected;
However, failures may cause the application to abort, crash or deadlock.

In this framework, the following components are tested to operate under
failure scenarios: {}
--------------------------------------------------------------------------
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef
1 more process has sent help message help-mpi-ft.txt / module:untested:failundef 

< long wait here >

--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    node-died
But I couldn't open the help file:
    (null).  Sorry!
--------------------------------------------------------------------------

I have also tried running with "/home/ompi5rc7/bin/mpirun --with-ft ulfm --mca btl tcp,self -n 2 --hostfile ../hosts ./a.out" but get the same output. I am not using RDMA.

Is it possible to print out the error code after a node failure?

May 15 '22 11:05 hatmer

As the large banner on your output indicates, using ULFM with RMA windows is very experimental at this point.

We have had success in the past running some code using the osc_pt2pt component, but this component is currently removed. The code reports you are using the osc_rdma component, which has not been tested and may very well deadlock when faults happen in a variety of situations.

You may still have luck modifying your test program in the following ways:

you need to set an error handler on the Window to capture errors during GET/PUT/FENCE calls: use MPI_WIN_SET_ERRHANDLER (by default, the error handler on windows is MPI_ERRORS_ARE_FATAL, and is not modified when changing the error handler on MPI_COMM_WORLD)
capture the ret from the MPI_FENCE operations as well
The way you inject your failures (e.g., essentially making the node network unreachable) means that you may detect the failure only after TCP timeouts, which can take up to 1 hour. You can inject failures by adding a raise(SIGKILL) from the code of one of your ranks, rather than blocking TCP traffic: the infrastructure will detect the failed process immediately and propagate the fault much quicker than TCP timeouts.

May 23 '22 19:05 abouteiller

@abouteiller Can you look into the "Sorry! You were supposed to get help..." issue? It seems like there was supposed to be a real help message there.

May 23 '22 20:05 jsquyres

The 'node-died' issue appears to be related to prted not finding its own files share/prte/help-errmgr-base.txt

Are you using an internal prte? (it is indicated in the final lines of 'configure' output)

May 23 '22 20:05 abouteiller

I am using an internal prte. I tried adding the path to the openmpi installation to PATH, but that did not fix the missing error text. It's not a problem though.

Calling

MPI_Win_set_errhandler(window, MPI_ERRORS_RETURN);

after creating the window does not fix the problem of the node crash not being returned as an error.

It seems the program crashes once the MPI_Get attempts to access a node that is inaccessible (node-died error) or has raised a SIGKILL (fails silently).

Is it safe to say that I should wait for the full OpenMPI 5 release?

Thank you for your time.

Side note: I do not have RDMA installed, and when I run the program without node failures it runs correctly, so perhaps RDMA is being reported in error?

May 24 '22 13:05 hatmer

The root cause for the missing help message is here https://github.com/openpmix/prrte/issues/1360

May 26 '22 19:05 abouteiller

FWIW: the missing help message problem has been fixed in PRRTE (both master and release branches).

Jun 06 '22 17:06 rhc54

ompi ompi copied to clipboard

Return error from node failure

Background information

What version of Open MPI are you using?

Describe how Open MPI was installed

Please describe the system on which you are running

Details of the problem

ompi
ompi copied to clipboard