ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Regression: Spawned process are not killed on timeout

Open AntonDaumen opened this issue 1 month ago • 1 comments

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

5.0.8

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From a git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

907b1ccaeec61a1197f0ee5264d4fef20b257b84 3rd-party/openpmix (v5.0.8) 222f03fbb98b71abd293aa205b38fa9a38e57965 3rd-party/prrte (v3.0.11) dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (heads/main)

Please describe the system on which you are running

  • Operating system/version: RHEL 9.4 (Linux 5.14.0-427.42.1.el9_4.aarch64)
  • Computer hardware: ARM Neoverse-N1
  • Network type: no network used to reproduce

Details of the problem

First of all sorry if this report belongs in the PRRTE github issues, I wasn't sure and decided to open it here first. I'll open it there if it is more appropriate.

With Open MPI 5, when a MPI application with spawned process hits a timeout, the spawned process don't seem to be killed and the application doesn't stop. It seems most of the time the application is finally killed after exactly 1 hour, although I have seen cases where it seemed like the application was never killed.

This seem to be a regression as I have never been able to reproduce it with an Open MPI 4 version.

I am using this simple test to test reproduce this issue: spawn_timeout_reprod.c

Compiled with: mpicc spawn_timeout_reprod.c -o spawn_timeout_reprod Launched with: time mpirun --tag-output --report-state-on-timeout --timeout 5 --np 1 ./spawn_timeout_reprod

Bellow this you will find both the Open MPI 4 and the Open MPI 5 output of this same test. Note the difference in output of the --tag-output --report-state-on-timeout options (although this is much less problematic), it seems that a lot of information about the spawned process are lost with Open MPI 5.

With Open MPI 4 the test is killed in around 6s, so the timeout is effective. With Open MPI 5 the tests ends in 15 seconds after the sleep ends, so the timeout is ineffective.

Open MPI 4 output

~/Workdir $ ompi_info | grep Ident
            Ident string: 4.1.8rc1

~/Workdir $ time mpirun --tag-output --report-state-on-timeout --timeout 5 --np 1 ./spawn_timeout_reprod
[1,0]<stdout>:Spawning 1 processes
[1,0]<stdout>:Sleeping for 15 seconds
[2,0]<stdout>:Got spawned
[2,0]<stdout>:Sleeping for 15 seconds
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 5 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
DATA FOR JOB: [51378,0]
    Num apps: 1 Num procs: 1    JobState: ALL DAEMONS REPORTED  Abort: False
    Num launched: 0 Num reported: 1 Num terminated: 0

    Procs:
        Rank: 0 Node: login1    PID: 2596154    State: RUNNING  ExitCode 0

DATA FOR JOB: [51378,1]
    Num apps: 1 Num procs: 1    JobState: SYNC REGISTERED   Abort: False
    Num launched: 1 Num reported: 1 Num terminated: 0

    Procs:
        Rank: 0 Node: login1    PID: 2596157    State: SYNC REGISTERED  ExitCode 0

DATA FOR JOB: [51378,2]
    Num apps: 1 Num procs: 1    JobState: SYNC REGISTERED   Abort: False
    Num launched: 1 Num reported: 1 Num terminated: 0

    Procs:
        Rank: 0 Node: login1    PID: 2596160    State: SYNC REGISTERED  ExitCode 0

real    0m6.070s
user    0m0.025s
sys     0m0.037s

Open MPI 5 output

~/Workdir $ ompi_info | grep Ident
            Ident string: 5.0.8rc3

~/Workdir $ time mpirun --tag-output --report-state-on-timeout --timeout 5 --np 1 ./spawn_timeout_reprod
[1,0]<stdout>: Spawning 1 processes
Got spawned
Sleeping for 15 seconds
[1,0]<stdout>: Sleeping for 15 seconds
[1,WILDCARD]<stderr>: --------------------------------------------------------------------------
[1,WILDCARD]<stderr>: The user-provided time limit for job execution has been reached:
[1,WILDCARD]<stderr>:
[1,WILDCARD]<stderr>:   Timeout: 5 seconds
[1,WILDCARD]<stderr>:
[1,WILDCARD]<stderr>: The job will now be aborted.  Please check your code and/or
[1,WILDCARD]<stderr>: adjust/remove the job execution time limit (as specified by --timeout
[1,WILDCARD]<stderr>: command line option or MPIEXEC_TIMEOUT environment variable).
[1,WILDCARD]<stderr>: --------------------------------------------------------------------------
[1,WILDCARD]<stderr>: DATA FOR JOB: prterun-login1-2597499@1
[1,WILDCARD]<stderr>:   Num apps: 1 Num procs: 1    JobState: SYNC REGISTERED   Abort: False
[1,WILDCARD]<stderr>:   Num launched: 1 Num reported: 1 Num terminated: 0
[1,WILDCARD]<stderr>:
[1,WILDCARD]<stderr>:   Procs:
[1,WILDCARD]<stderr>:       Rank: 0 Node: login1    PID: 2597502    State: SYNC REGISTERED  ExitCode 0
[1,WILDCARD]<stderr>:

real    0m15.214s
user    0m0.047s
sys     0m0.023s

AntonDaumen avatar Oct 20 '25 14:10 AntonDaumen