ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Fault tolerant error when re spawn process in mpiexec in remote node

Open dariomnz opened this issue 1 year ago • 0 comments

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

  • ompi_info --version Open MPI v5.0.3

https://www.open-mpi.org/community/help/

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.3.tar.gz
tar zxf openmpi-5.0.3.tar.gz
ln   -s openmpi-5.0.3  openmpi

# 4) Install openmpi (from source code)
mkdir -p /home/lab/bin
cd       ${DESTINATION_PATH}/openmpi
./configure --prefix=/home/lab/bin/openmpi
make -j $(nproc) all
make install
output of ompi_info
+ ompi_info
                 Package: Open MPI root@buildkitsandbox Distribution
                Open MPI: 5.0.3
  Open MPI repo revision: v5.0.3
   Open MPI release date: Apr 08, 2024
                 MPI API: 3.1.0
            Ident string: 5.0.3
                  Prefix: /home/lab/bin/openmpi
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: root
           Configured on: Fri May 31 08:42:58 UTC 2024
          Configure host: buildkitsandbox
  Configure command line: '--prefix=/home/lab/bin/openmpi'
                Built by: 
                Built on: Fri May 31 08:51:40 UTC 2024
              Built host: buildkitsandbox
              C bindings: yes
             Fort mpif.h: no
            Fort use mpi: no
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: no
 Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /bin/gcc
  C compiler family name: GNU
      C compiler version: 11.4.0
            C++ compiler: g++
   C++ compiler absolute: /bin/g++
           Fort compiler: none
       Fort compiler abs: none
         Fort ignore TKR: no
   Fort 08 assumed shape: no
      Fort optional args: no
          Fort INTERFACE: no
    Fort ISO_FORTRAN_ENV: no
       Fort STORAGE_SIZE: no
      Fort BIND(C) (all): no
      Fort ISO_C_BINDING: no
 Fort SUBROUTINE BIND(C): no
       Fort TYPE,BIND(C): no
 Fort T,BIND(C,name="a"): no
            Fort PRIVATE: no
           Fort ABSTRACT: no
       Fort ASYNCHRONOUS: no
          Fort PROCEDURE: no
         Fort USE...ONLY: no
           Fort C_FUNLOC: no
 Fort f08 using wrappers: no
         Fort MPI_SIZEOF: no
             C profiling: yes
   Fort mpif.h profiling: no
  Fort use mpi profiling: no
   Fort use mpi_f08 prof: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
          MPI extensions: affinity, cuda, ftmpi, rocm
 Fault Tolerance support: yes
          FT MPI support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.0.3)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.0.3)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.0.3)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.0.3)
                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.0.3)
                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.0.3)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.0.3)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.0.3)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.0.3)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.0.3)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v5.0.3)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.0.3)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                MCA smsc: cma (MCA v2.1.0, API v1.0.0, Component v5.0.3)
             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.0.3)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.0.3)
                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA coll: monitoring (MCA v2.1.0, API v2.4.0, Component
                          v5.0.3)
                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.0.3)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.0.3)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
                          v5.0.3)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                  MCA io: romio341 (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v5.0.3)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.0.3)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v5.0.3)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.0.3)
                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.0.3)
                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.0.3)
                 MCA pml: monitoring (MCA v2.1.0, API v2.1.0, Component
                          v5.0.3)
                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.0.3)
                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.0.3)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.0.3)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.0.3)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v5.0.3)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v5.0.3)

Please describe the system on which you are running

  • Operating system/version: docker Ubuntu 22.04.4 LTS
  • Computer hardware: irrelevant
  • Network type: irrelevant

Details of the problem

I have a master that spawns a worker, when the worker dies that is simulated with a sigkill, the master spawns another one. It is fault tolerant. Locally in a node it works perfectly, but when it is executed in another node remotely it gives problems and does not execute correctly. Just adding --host node2 fails. The spawn is done locally.

Code:

#include "mpi.h"
#include "mpi-ext.h"
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <limits.h>
#include <signal.h>


int main( int argc, char *  argv[] )
{
    MPI_Comm parentcomm, intercomm;
    int cast_buf,rank,ret,eclass,len;
    int errcodes[1];
    char estring[MPI_MAX_ERROR_STRING];

    char serv_name[HOST_NAME_MAX];
    gethostname(serv_name, HOST_NAME_MAX);

    MPI_Init( &argc, &argv );
    MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
    MPI_Comm_get_parent( &parentcomm );
    if (parentcomm == MPI_COMM_NULL)
    {
        do
        {
            ret = MPI_Comm_spawn( "/work/xpn/test/integrity/mpi_connect_accept/test" , MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, errcodes );
            MPI_Comm_set_errhandler(intercomm, MPI_ERRORS_RETURN);
            
            MPI_Error_class(ret, &eclass);
            MPI_Error_string(ret, estring, &len);
            printf("MPI_Comm_spawn ret %d: %s\n", eclass, estring);
            MPI_Error_class(errcodes[0], &eclass);
            MPI_Error_string(errcodes[0], estring, &len);
            printf("MPI_Comm_spawn errcodes[0] %d: %s\n", eclass, estring);

            printf("I'm the parent. %d %d %s\n",ret,errcodes[0],serv_name);

            ret = MPI_Bcast(&cast_buf,1,MPI_INT, 0,intercomm);
            MPI_Error_class(ret, &eclass);
            MPI_Error_string(ret, estring, &len);
            printf("Parent Bcast Error ret %d: %s\n", eclass, estring);

            ret = MPI_Bcast(&cast_buf,1,MPI_INT, 0,intercomm);
            MPI_Error_class(ret, &eclass);
            MPI_Error_string(ret, estring, &len);
            printf("Parent Bcast Error ret %d: %s\n", eclass, estring);
            if(eclass != MPIX_ERR_PROC_FAILED)
                break;
            
        } while (1);
    }else{
        printf("I'm the child. %d %d %s\n",ret,errcodes[0],serv_name);
        sleep(1);
        ret = MPI_Bcast(&cast_buf,1,MPI_INT, MPI_ROOT,parentcomm);
        MPI_Error_class(ret, &eclass);
        MPI_Error_string(ret, estring, &len);
        printf("Child Bcast Error ret %d: %s\n", eclass, estring);
        raise(SIGKILL);
        ret = MPI_Bcast(&cast_buf,1,MPI_INT, MPI_ROOT,parentcomm);
        MPI_Error_class(ret, &eclass);
        MPI_Error_string(ret, estring, &len);
        printf("Child Bcast Error ret %d: %s\n", eclass, estring);
    }
    fflush(stdout);
    MPI_Finalize();
    return 0;
}

Hostname:

+ hostname
2e7630b38c9e

Good execution

+ mpicc -g -o test test.c
+ mpiexec -n 1 --with-ft ulfm --map-by node:OVERSUBSCRIBE ./test
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8596 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8598 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
I'm the child. 0 0 2e7630b38c9e
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8600 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
...................

Bad remote execution:

+ mpiexec -n 1 --host c1de8f727368 --with-ft ulfm --map-by node:OVERSUBSCRIBE ./test
Warning: Permanently added 'c1de8f727368' (ED25519) to the list of known hosts.
I'm the child. 0 0 c1de8f727368
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 c1de8f727368
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 3861 on node c1de8f727368 exited on
signal 9 (Killed).
--------------------------------------------------------------------------
MPI_Comm_spawn ret 14: MPI_ERR_UNKNOWN: unknown error
MPI_Comm_spawn errcodes[0] 14: MPI_ERR_UNKNOWN: unknown error
I'm the parent. 14 -50 c1de8f727368
Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator
Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator
With debug

With debug good run:

+ mpiexec -n 1 --with-ft ulfm --verbose --debug-daemons --mca btl_base_verbose 100 --mca mpi_ft_verbose 100 --map-by node:OVERSUBSCRIBE ./test
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[2e7630b38c9e:08989] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08989] mca: base: components_register: found loaded component self
[2e7630b38c9e:08989] mca: base: components_register: component self register function successful
[2e7630b38c9e:08989] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08989] mca: base: components_register: component sm register function successful
[2e7630b38c9e:08989] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08989] mca: base: components_register: component tcp register function successful
[2e7630b38c9e:08989] mca: base: components_open: opening btl components
[2e7630b38c9e:08989] mca: base: components_open: found loaded component self
[2e7630b38c9e:08989] mca: base: components_open: component self open function successful
[2e7630b38c9e:08989] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08989] mca: base: components_open: component sm open function successful
[2e7630b38c9e:08989] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08989] mca: base: components_open: component tcp open function successful
[2e7630b38c9e:08989] [[33963,1],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08989] select: initializing btl component self
[2e7630b38c9e:08989] select: init of component self returned success
[2e7630b38c9e:08989] select: initializing btl component sm
[2e7630b38c9e:08989] select: init of component sm returned failure
[2e7630b38c9e:08989] mca: base: close: component sm closed
[2e7630b38c9e:08989] mca: base: close: unloading component sm
[2e7630b38c9e:08989] select: initializing btl component tcp
[2e7630b38c9e:08989] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08989] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08989] btl: tcp: Using interface: sppp 
[2e7630b38c9e:08989] btl:tcp: 0x5650082292e0: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08989] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08989] btl:tcp: Successfully bound to AF_INET port 1024
[2e7630b38c9e:08989] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[2e7630b38c9e:08989] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08989] select: init of component tcp returned success
[2e7630b38c9e:08989] mca: bml: Using self btl for send to [[33963,1],0] on node 2e7630b38c9e
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[2e7630b38c9e:08991] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08991] mca: base: components_register: found loaded component self
[2e7630b38c9e:08991] mca: base: components_register: component self register function successful
[2e7630b38c9e:08991] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08991] mca: base: components_register: component sm register function successful
[2e7630b38c9e:08991] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08991] mca: base: components_register: component tcp register function successful
[2e7630b38c9e:08991] mca: base: components_open: opening btl components
[2e7630b38c9e:08991] mca: base: components_open: found loaded component self
[2e7630b38c9e:08991] mca: base: components_open: component self open function successful
[2e7630b38c9e:08991] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08991] mca: base: components_open: component sm open function successful
[2e7630b38c9e:08991] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08991] mca: base: components_open: component tcp open function successful
[2e7630b38c9e:08991] [[33963,2],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08991] select: initializing btl component self
[2e7630b38c9e:08991] select: init of component self returned success
[2e7630b38c9e:08991] select: initializing btl component sm
[2e7630b38c9e:08991] select: init of component sm returned failure
[2e7630b38c9e:08991] mca: base: close: component sm closed
[2e7630b38c9e:08991] mca: base: close: unloading component sm
[2e7630b38c9e:08991] select: initializing btl component tcp
[2e7630b38c9e:08991] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08991] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08991] btl: tcp: Using interface: sppp 
[2e7630b38c9e:08991] btl:tcp: 0x5605f1ac9560: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08991] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08991] btl:tcp: Attempting to bind to AF_INET port 1025
[2e7630b38c9e:08991] btl:tcp: Successfully bound to AF_INET port 1025
[2e7630b38c9e:08991] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[2e7630b38c9e:08991] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08991] select: init of component tcp returned success
[2e7630b38c9e:08991] mca: bml: Using self btl for send to [[33963,2],0] on node 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
rank 0
[2e7630b38c9e:08991] mca: bml: Using tcp btl for send to [[33963,1],0] on node unknown
[2e7630b38c9e:08991] btl: tcp: attempting to connect() to [[33963,1],0] address 172.24.0.2 on port 1024
[2e7630b38c9e:08991] btl:tcp: would block, so allowing background progress
[2e7630b38c9e:08991] btl:tcp: connect() to 172.24.0.2:1024 completed (complete_connect), sending connect ACK
[2e7630b38c9e:08989] btl:tcp: now connected to 172.24.0.2, process [[33963,2],0]
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
[2e7630b38c9e:08989] [[33963,1],0] ompi: Process [[33963,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0]:state_dvm.c(620) updating exit status to 137
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8991 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7f29fc7b681b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7f29fcb0aef7]
[ 2] /home/lab/bin/openmpi/lib/libopen-pal.so.80(mca_btl_tcp_frag_recv+0x148)[0x7f29fc8140e8]
[ 3] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0xb41a3)[0x7f29fc8121a3]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e3a8)[0x7f29fc50b3a8]
[ 5] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f29fc50bb07]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7f29fc782b2f]
[ 7] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f29fc782be5]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv+0x360)[0x7f29fccc6820]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_coll_inter_bcast_inter+0x4e)[0x7f29fcbe3c7e]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Bcast+0x13d)[0x7f29fcb36c8d]
[11] ./test(+0x15c3)[0x5650074635c3]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f29fc887d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f29fc887e40]
[14] ./test(+0x1245)[0x565007463245]
[2e7630b38c9e:08989] [[33963,1],0] ompi_request_is_failed: Request 0x565008264200 (peer 0) is part of a collective (tag -17), and some process died. (mpi_source  -1)
[2e7630b38c9e:08989] Recv_request_cancel: cancel granted for request 0x565008264200 because it has not matched
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
[2e7630b38c9e:08993] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08993] mca: base: components_register: found loaded component self
[2e7630b38c9e:08993] mca: base: components_register: component self register function successful
[2e7630b38c9e:08993] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08993] mca: base: components_register: component sm register function successful
[2e7630b38c9e:08993] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08993] mca: base: components_register: component tcp register function successful
[2e7630b38c9e:08993] mca: base: components_open: opening btl components
[2e7630b38c9e:08993] mca: base: components_open: found loaded component self
[2e7630b38c9e:08993] mca: base: components_open: component self open function successful
[2e7630b38c9e:08993] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08993] mca: base: components_open: component sm open function successful
[2e7630b38c9e:08993] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08993] mca: base: components_open: component tcp open function successful
[2e7630b38c9e:08993] [[33963,3],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08993] select: initializing btl component self
[2e7630b38c9e:08993] select: init of component self returned success
[2e7630b38c9e:08993] select: initializing btl component sm
[2e7630b38c9e:08993] select: init of component sm returned failure
[2e7630b38c9e:08993] mca: base: close: component sm closed
[2e7630b38c9e:08993] mca: base: close: unloading component sm
[2e7630b38c9e:08993] select: initializing btl component tcp
[2e7630b38c9e:08993] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08993] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08993] btl: tcp: Using interface: sppp 
[2e7630b38c9e:08993] btl:tcp: 0x555e7b27d520: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08993] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08993] btl:tcp: Attempting to bind to AF_INET port 1025
[2e7630b38c9e:08993] btl:tcp: Successfully bound to AF_INET port 1025
[2e7630b38c9e:08993] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[2e7630b38c9e:08993] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08993] select: init of component tcp returned success
[2e7630b38c9e:08993] [[33963,3],0] ompi: Process [[33963,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fe3e0c1081b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fe3e0f64ef7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7fe3e0f651da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7fe3e09652b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fe3e0965b07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fe3e0bdcb2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fe3e0bdcbe5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7fe3e0f83c58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7fe3e0f842f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7fe3e0f76a7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fe3e0fac432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x555e7a6b7361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe3e0ce1d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe3e0ce1e40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x555e7a6b7245]
[2e7630b38c9e:08993] mca: bml: Using self btl for send to [[33963,3],0] on node 2e7630b38c9e
I'm the child. 0 0 2e7630b38c9e
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
rank 0
[2e7630b38c9e:08993] mca: bml: Using tcp btl for send to [[33963,1],0] on node unknown
[2e7630b38c9e:08993] btl: tcp: attempting to connect() to [[33963,1],0] address 172.24.0.2 on port 1024
[2e7630b38c9e:08993] btl:tcp: would block, so allowing background progress
[2e7630b38c9e:08993] btl:tcp: connect() to 172.24.0.2:1024 completed (complete_connect), sending connect ACK
[2e7630b38c9e:08989] btl:tcp: now connected to 172.24.0.2, process [[33963,3],0]
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
[2e7630b38c9e:08989] [[33963,1],0] ompi: Process [[33963,3],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7f29fc7b681b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7f29fcb0aef7]
[ 2] /home/lab/bin/openmpi/lib/libopen-pal.so.80(mca_btl_tcp_frag_recv+0x148)[0x7f29fc8140e8]
[ 3] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0xb41a3)[0x7f29fc8121a3]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e3a8)[0x7f29fc50b3a8]
[ 5] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f29fc50bb07]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7f29fc782b2f]
[ 7] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f29fc782be5]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv+0x360)[0x7f29fccc6820]
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 8993 on node 2e7630b38c9e exited on
signal 9 (Killed).
--------------------------------------------------------------------------
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:08986] [prterun-2e7630b38c9e-8986@0,0] prted_cmd: received add_local_procs
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_coll_inter_bcast_inter+0x4e)[0x7f29fcbe3c7e]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Bcast+0x13d)[0x7f29fcb36c8d]
[11] ./test(+0x15c3)[0x5650074635c3]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f29fc887d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f29fc887e40]
[14] ./test(+0x1245)[0x565007463245]
[2e7630b38c9e:08989] [[33963,1],0] ompi_request_is_failed: Request 0x565008264200 (peer 0) is part of a collective (tag -17), and some process died. (mpi_source  -1)
[2e7630b38c9e:08989] Recv_request_cancel: cancel granted for request 0x565008264200 because it has not matched
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
[2e7630b38c9e:08995] mca: base: components_register: registering framework btl components
[2e7630b38c9e:08995] mca: base: components_register: found loaded component self
[2e7630b38c9e:08995] mca: base: components_register: component self register function successful
[2e7630b38c9e:08995] mca: base: components_register: found loaded component sm
[2e7630b38c9e:08995] mca: base: components_register: component sm register function successful
[2e7630b38c9e:08995] mca: base: components_register: found loaded component tcp
[2e7630b38c9e:08995] mca: base: components_register: component tcp register function successful
[2e7630b38c9e:08995] mca: base: components_open: opening btl components
[2e7630b38c9e:08995] mca: base: components_open: found loaded component self
[2e7630b38c9e:08995] mca: base: components_open: component self open function successful
[2e7630b38c9e:08995] mca: base: components_open: found loaded component sm
[2e7630b38c9e:08995] mca: base: components_open: component sm open function successful
[2e7630b38c9e:08995] mca: base: components_open: found loaded component tcp
[2e7630b38c9e:08995] mca: base: components_open: component tcp open function successful
[2e7630b38c9e:08995] [[33963,4],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[2e7630b38c9e:08995] select: initializing btl component self
[2e7630b38c9e:08995] select: init of component self returned success
[2e7630b38c9e:08995] select: initializing btl component sm
[2e7630b38c9e:08995] select: init of component sm returned failure
[2e7630b38c9e:08995] mca: base: close: component sm closed
[2e7630b38c9e:08995] mca: base: close: unloading component sm
[2e7630b38c9e:08995] select: initializing btl component tcp
[2e7630b38c9e:08995] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[2e7630b38c9e:08995] btl: tcp: Found match: 127.0.0.1 (lo)
[2e7630b38c9e:08995] btl: tcp: Using interface: sppp 
[2e7630b38c9e:08995] btl:tcp: 0x564f26846520: if eth0 kidx 6 cnt 0 addr 172.24.0.2 IPv4 bw 10000 lt 100
[2e7630b38c9e:08995] btl:tcp: Attempting to bind to AF_INET port 1024
[2e7630b38c9e:08995] btl:tcp: Attempting to bind to AF_INET port 1025
[2e7630b38c9e:08995] btl:tcp: Successfully bound to AF_INET port 1025
[2e7630b38c9e:08995] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[2e7630b38c9e:08995] btl: tcp: exchange: 0 6 IPv4 172.24.0.2
[2e7630b38c9e:08995] select: init of component tcp returned success
[2e7630b38c9e:08995] [[33963,4],0] ompi: Process [[33963,3],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fdb24e1981b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fdb2516def7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7fdb2516e1da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7fdb24b6e2b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fdb24b6eb07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fdb24de5b2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fdb24de5be5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7fdb2518cc58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7fdb2518d2f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7fdb2517fa7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fdb251b5432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x564f258b5361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fdb24eead90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fdb24eeae40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x564f258b5245]
[2e7630b38c9e:08995] [[33963,4],0] ompi: Process [[33963,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fdb24e1981b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fdb2516def7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7fdb2516e1da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7fdb24b6e2b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fdb24b6eb07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fdb24de5b2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fdb24de5be5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7fdb2518cc58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7fdb2518d2f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7fdb2517fa7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fdb251b5432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x564f258b5361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fdb24eead90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fdb24eeae40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x564f258b5245]
[2e7630b38c9e:08995] mca: bml: Using self btl for send to [[33963,4],0] on node 2e7630b38c9e
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 2e7630b38c9e
rank 0
I'm the child. 0 0 2e7630b38c9e
........

Bad remote execution:

+ mpiexec -n 1 --host c1de8f727368 --with-ft ulfm --verbose --debug-daemons --mca btl_base_verbose 100 --mca mpi_ft_verbose 100 --map-by node:OVERSUBSCRIBE ./test
Daemon was launched on c1de8f727368 - beginning to initialize
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received add_local_procs
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received add_local_procs
[c1de8f727368:03925] mca: base: components_register: registering framework btl components
[c1de8f727368:03925] mca: base: components_register: found loaded component self
[c1de8f727368:03925] mca: base: components_register: component self register function successful
[c1de8f727368:03925] mca: base: components_register: found loaded component sm
[c1de8f727368:03925] mca: base: components_register: component sm register function successful
[c1de8f727368:03925] mca: base: components_register: found loaded component tcp
[c1de8f727368:03925] mca: base: components_register: component tcp register function successful
[c1de8f727368:03925] mca: base: components_open: opening btl components
[c1de8f727368:03925] mca: base: components_open: found loaded component self
[c1de8f727368:03925] mca: base: components_open: component self open function successful
[c1de8f727368:03925] mca: base: components_open: found loaded component sm
[c1de8f727368:03925] mca: base: components_open: component sm open function successful
[c1de8f727368:03925] mca: base: components_open: found loaded component tcp
[c1de8f727368:03925] mca: base: components_open: component tcp open function successful
[c1de8f727368:03925] [[61898,1],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[c1de8f727368:03925] select: initializing btl component self
[c1de8f727368:03925] select: init of component self returned success
[c1de8f727368:03925] select: initializing btl component sm
[c1de8f727368:03925] select: init of component sm returned failure
[c1de8f727368:03925] mca: base: close: component sm closed
[c1de8f727368:03925] mca: base: close: unloading component sm
[c1de8f727368:03925] select: initializing btl component tcp
[c1de8f727368:03925] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[c1de8f727368:03925] btl: tcp: Found match: 127.0.0.1 (lo)
[c1de8f727368:03925] btl: tcp: Using interface: sppp 
[c1de8f727368:03925] btl:tcp: 0x55e3ea152000: if eth0 kidx 10 cnt 0 addr 172.24.0.4 IPv4 bw 10000 lt 100
[c1de8f727368:03925] btl:tcp: Attempting to bind to AF_INET port 1024
[c1de8f727368:03925] btl:tcp: Successfully bound to AF_INET port 1024
[c1de8f727368:03925] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[c1de8f727368:03925] btl: tcp: exchange: 0 10 IPv4 172.24.0.4
[c1de8f727368:03925] select: init of component tcp returned success
[c1de8f727368:03925] mca: bml: Using self btl for send to [[61898,1],0] on node c1de8f727368
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received add_local_procs
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received add_local_procs
[c1de8f727368:03927] mca: base: components_register: registering framework btl components
[c1de8f727368:03927] mca: base: components_register: found loaded component self
[c1de8f727368:03927] mca: base: components_register: component self register function successful
[c1de8f727368:03927] mca: base: components_register: found loaded component sm
[c1de8f727368:03927] mca: base: components_register: component sm register function successful
[c1de8f727368:03927] mca: base: components_register: found loaded component tcp
[c1de8f727368:03927] mca: base: components_register: component tcp register function successful
[c1de8f727368:03927] mca: base: components_open: opening btl components
[c1de8f727368:03927] mca: base: components_open: found loaded component self
[c1de8f727368:03927] mca: base: components_open: component self open function successful
[c1de8f727368:03927] mca: base: components_open: found loaded component sm
[c1de8f727368:03927] mca: base: components_open: component sm open function successful
[c1de8f727368:03927] mca: base: components_open: found loaded component tcp
[c1de8f727368:03927] mca: base: components_open: component tcp open function successful
[c1de8f727368:03927] [[61898,2],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[c1de8f727368:03927] select: initializing btl component self
[c1de8f727368:03927] select: init of component self returned success
[c1de8f727368:03927] select: initializing btl component sm
[c1de8f727368:03927] select: init of component sm returned failure
[c1de8f727368:03927] mca: base: close: component sm closed
[c1de8f727368:03927] mca: base: close: unloading component sm
[c1de8f727368:03927] select: initializing btl component tcp
[c1de8f727368:03927] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[c1de8f727368:03927] btl: tcp: Found match: 127.0.0.1 (lo)
[c1de8f727368:03927] btl: tcp: Using interface: sppp 
[c1de8f727368:03927] btl:tcp: 0x55cfffe47330: if eth0 kidx 10 cnt 0 addr 172.24.0.4 IPv4 bw 10000 lt 100
[c1de8f727368:03927] btl:tcp: Attempting to bind to AF_INET port 1024
[c1de8f727368:03927] btl:tcp: Attempting to bind to AF_INET port 1025
[c1de8f727368:03927] btl:tcp: Successfully bound to AF_INET port 1025
[c1de8f727368:03927] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[c1de8f727368:03927] btl: tcp: exchange: 0 10 IPv4 172.24.0.4
[c1de8f727368:03927] select: init of component tcp returned success
[c1de8f727368:03927] mca: bml: Using self btl for send to [[61898,2],0] on node c1de8f727368
I'm the child. 0 0 c1de8f727368
MPI_Comm_spawn ret 0: MPI_SUCCESS: no errors
MPI_Comm_spawn errcodes[0] 0: MPI_SUCCESS: no errors
I'm the parent. 0 0 c1de8f727368
rank 0
[c1de8f727368:03927] mca: bml: Using tcp btl for send to [[61898,1],0] on node unknown
[c1de8f727368:03927] btl: tcp: attempting to connect() to [[61898,1],0] address 172.24.0.4 on port 1024
[c1de8f727368:03927] btl:tcp: would block, so allowing background progress
[c1de8f727368:03927] btl:tcp: connect() to 172.24.0.4:1024 completed (complete_connect), sending connect ACK
[c1de8f727368:03925] btl:tcp: now connected to 172.24.0.4, process [[61898,2],0]
Child Bcast Error ret 0: MPI_SUCCESS: no errors
Parent Bcast Error ret 0: MPI_SUCCESS: no errors
[c1de8f727368:03925] [[61898,1],0] ompi: Process [[61898,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0]:state_dvm.c(620) updating exit status to 137
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 3927 on node c1de8f727368 exited on
signal 9 (Killed).
--------------------------------------------------------------------------
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received add_local_procs
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7fc3f773881b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7fc3f7a8cef7]
[ 2] /home/lab/bin/openmpi/lib/libopen-pal.so.80(mca_btl_tcp_frag_recv+0x148)[0x7fc3f77960e8]
[ 3] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0xb41a3)[0x7fc3f77941a3]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e3a8)[0x7fc3f748d3a8]
[ 5] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7fc3f748db07]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7fc3f7704b2f]
[ 7] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7fc3f7704be5]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_pml_ob1_recv+0x360)[0x7fc3f7c48820]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(mca_coll_inter_bcast_inter+0x4e)[0x7fc3f7b65c7e]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Bcast+0x13d)[0x7fc3f7ab8c8d]
[11] ./test(+0x15c3)[0x55e3e83e95c3]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fc3f7809d90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fc3f7809e40]
[14] ./test(+0x1245)[0x55e3e83e9245]
[c1de8f727368:03925] [[61898,1],0] ompi_request_is_failed: Request 0x55e3ea18cf80 (peer 0) is part of a collective (tag -17), and some process died. (mpi_source  -1)
[c1de8f727368:03925] Recv_request_cancel: cancel granted for request 0x55e3ea18cf80 because it has not matched
[c1de8f727368:03925] Rank 00000: DONE WITH FINALIZE
Parent Bcast Error ret 75: MPI_ERR_PROC_FAILED: Process Failure
MPI_Comm_spawn ret 14: MPI_ERR_UNKNOWN: unknown error
MPI_Comm_spawn errcodes[0] 14: MPI_ERR_UNKNOWN: unknown error
I'm the parent. 14 -50 c1de8f727368
rank 0
Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator
Parent Bcast Error ret 5: MPI_ERR_COMM: invalid communicator
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received add_local_procs
[c1de8f727368:03907] PRTE ERROR: Not found in file prted/pmix/pmix_server_dyn.c at line 75
[c1de8f727368:03929] mca: base: components_register: registering framework btl components
[c1de8f727368:03929] mca: base: components_register: found loaded component self
[c1de8f727368:03929] mca: base: components_register: component self register function successful
[c1de8f727368:03929] mca: base: components_register: found loaded component sm
[c1de8f727368:03929] mca: base: components_register: component sm register function successful
[c1de8f727368:03929] mca: base: components_register: found loaded component tcp
[c1de8f727368:03929] mca: base: components_register: component tcp register function successful
[c1de8f727368:03929] mca: base: components_open: opening btl components
[c1de8f727368:03929] mca: base: components_open: found loaded component self
[c1de8f727368:03929] mca: base: components_open: component self open function successful
[c1de8f727368:03929] mca: base: components_open: found loaded component sm
[c1de8f727368:03929] mca: base: components_open: component sm open function successful
[c1de8f727368:03929] mca: base: components_open: found loaded component tcp
[c1de8f727368:03929] mca: base: components_open: component tcp open function successful
[c1de8f727368:03929] [[61898,3],0] ftagree:register) Agreement Algorithm - Early Returning Consensus Algorithm
[c1de8f727368:03925] mca: base: close: component self closed
[c1de8f727368:03925] mca: base: close: unloading component self
[c1de8f727368:03925] mca: base: close: component tcp closed
[c1de8f727368:03925] mca: base: close: unloading component tcp
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_KILL_LOCAL_PROCS
[c1de8f727368:03929] select: initializing btl component self
[c1de8f727368:03929] select: init of component self returned success
[c1de8f727368:03929] select: initializing btl component sm
[c1de8f727368:03929] select: init of component sm returned failure
[c1de8f727368:03929] mca: base: close: component sm closed
[c1de8f727368:03929] mca: base: close: unloading component sm
[c1de8f727368:03929] select: initializing btl component tcp
[c1de8f727368:03929] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[c1de8f727368:03929] btl: tcp: Found match: 127.0.0.1 (lo)
[c1de8f727368:03929] btl: tcp: Using interface: sppp 
[c1de8f727368:03929] btl:tcp: 0x5591b49ac2f0: if eth0 kidx 10 cnt 0 addr 172.24.0.4 IPv4 bw 10000 lt 100
[c1de8f727368:03929] btl:tcp: Attempting to bind to AF_INET port 1024
[c1de8f727368:03929] btl:tcp: Successfully bound to AF_INET port 1024
[c1de8f727368:03929] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[c1de8f727368:03929] btl: tcp: exchange: 0 10 IPv4 172.24.0.4
[c1de8f727368:03929] select: init of component tcp returned success
[c1de8f727368:03929] [[61898,3],0] ompi: Process [[61898,2],0] failed (state = -200 PMIX_ERR_PROC_TERM_WO_SYNC).
[ 0] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_backtrace_print+0x5b)[0x7f25fca5b81b]
[ 1] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_errhandler_proc_failed_internal+0x5d7)[0x7f25fcdafef7]
[ 2] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x7d1da)[0x7f25fcdb01da]
[ 3] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(+0x1e2b8)[0x7f25fc7b02b8]
[ 4] /home/lab/bin/openmpi/lib/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f25fc7b0b07]
[ 5] /home/lab/bin/openmpi/lib/libopen-pal.so.80(+0x24b2f)[0x7f25fca27b2f]
[ 6] /home/lab/bin/openmpi/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f25fca27be5]
[ 7] /home/lab/bin/openmpi/lib/libmpi.so.40(+0x9bc58)[0x7f25fcdcec58]
[ 8] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_instance_init+0x68)[0x7f25fcdcf2f8]
[ 9] /home/lab/bin/openmpi/lib/libmpi.so.40(ompi_mpi_init+0xaf)[0x7f25fcdc1a7f]
[10] /home/lab/bin/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7f25fcdf7432]
[11] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1361)[0x5591b41f3361]
[12] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f25fcb2cd90]
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f25fcb2ce40]
[14] /work/xpn/test/integrity/mpi_connect_accept/test(+0x1245)[0x5591b41f3245]
[c1de8f727368:03929] mca: bml: Using self btl for send to [[61898,3],0] on node c1de8f727368
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: received exit cmd
[2e7630b38c9e:09052] [prterun-2e7630b38c9e-9052@0,0] prted_cmd: exit cmd, 1 routes still exist
[c1de8f727368:03907] PRTE ERROR: Not found in file prted/pmix/pmix_server_dyn.c at line 75
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_EXIT_CMD
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: received exit cmd
[c1de8f727368:03907] [prterun-2e7630b38c9e-9052@0,1] prted_cmd: all routes and children gone - exiting

dariomnz avatar Jun 04 '24 12:06 dariomnz