mpich icon indicating copy to clipboard operation
mpich copied to clipboard

ulfm: support recv on failed process

Open hzhou opened this issue 3 years ago • 2 comments

Pull Request Description

A key ingredient in the ULFM proposal is an MPI_Recv should return MPI_ERR_PROC_FAILED when the process it tries to receive from has failed. In ch3, this is natively supported in the device-layer progress loop. In ch4, because the individual operation progress has been delegated to the lower-level library, ch4-layer progress can't check and fail the receive. To support ULFM generally, we can check the request against failed process list at MPIR-wait layer where the actual requests is available. We only do this check when we are not making progress for a while and when there is failed process detected.

The failed processes are triggered by signal (SIGUSR1) from supported process manager (e.g. hydra) and updated during ch4 progress loop.

NOTES

  • We are enabling this without checking MPIR_CVAR_ENABLE_FT. For one, we only do this failure check when there is failed process. Thus there is no negative effect when there aren't process failures (e.g. benchmarking). For two, we only do this when the progress loop is not making quick progress already.

Author Checklist

  • [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • [x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
  • [x] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
  • [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

hzhou avatar Apr 05 '22 03:04 hzhou

Not quite reliable. While often works, sometime I get

~/work/pull_requests/2204_get_failed/test/mpi/ft$ mpirun -disable-auto-cleanup -l -n 2 ./recvdead
[0] MPIR_update_failed_procs: {1}
[0]  No Errors
~/work/pull_requests/2204_get_failed/test/mpi/ft$ mpirun -disable-auto-cleanup -l -n 2 ./recvdead
[0] MPIR_update_failed_procs: {1}
[0]  No Errors
~/work/pull_requests/2204_get_failed/test/mpi/ft$ mpirun -disable-auto-cleanup -l -n 2 ./recvdead
[0] MPIR_update_failed_procs: {1}
[0]  No Errors
~/work/pull_requests/2204_get_failed/test/mpi/ft$ mpirun -disable-auto-cleanup -l -n 2 ./recvdead
[0] MPIR_update_failed_procs: {1}
[0]  No Errors
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

hzhou avatar Apr 05 '22 03:04 hzhou

test:mpich/ch3/most test:mpich/ch4/most

hzhou avatar Apr 07 '22 13:04 hzhou