mpich icon indicating copy to clipboard operation
mpich copied to clipboard

gforker: propagation of exit status

Open dalcinl opened this issue 1 year ago • 5 comments

A few times a month, mpi4py's testsuite running daily in GitHub Actions fails the following way:

======================================================================
FAIL: testInterrupt (__main__.TestRunScript.testInterrupt)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/mpi4py-testing/mpi4py-testing/demo/test-run/test_run.py", line 173, in testInterrupt
    self.assertEqual(status, SIGINT + 128)
AssertionError: 221 != 130

Full logs here

This is the test code is here. What this test does is to execute some Python code with np=1,2 , and then one of the ranks (but not both in the np=2 case) signals itself with SIGINT. Upon an uncaught SIGINT, Python's signal/error handling prints to stderr and exits cleanly with return status=SIGINT+128 (that it, 130 on Linux). My test is indeed checking that exit statuses are propagated properly from the MPI process through mpiexec and back to the user. Very sporadically, on GitHub Actions, that is not the case for gforker.

I believe there must be either some sort of race condition, or perhaps a two short waiting time, that originates this non-reproducible issue. I looked at gforker's sources, but I'm afraid I do not understand it well enough as to guess where the problem may be. Any tips on how to further debug the issue?

dalcinl avatar Aug 26 '24 06:08 dalcinl

Fantastic! Now we have phishing in GitHub comments? :facepalm:

dalcinl avatar Aug 26 '24 06:08 dalcinl

The code that aggregates exit code is here : https://github.com/pmodels/mpich/blob/4f8e4a968012d6e59e22b606a76f613d61a7a91c/src/pm/util/process.c#L207-L209

If you can reproduce it locally, I would try add debug printf there to see if anything is usual.

PS: after that, I would try check here: https://github.com/pmodels/mpich/blob/4f8e4a968012d6e59e22b606a76f613d61a7a91c/src/pm/util/process.c#L407-L427

hzhou avatar Aug 26 '24 14:08 hzhou

Here you have a reproducer that at least show different behavior between hydra and gforker, and their interaction with the child process in case of SIGNIT vs no signal.

code='raise KeyboardInterrupt'
#code='raise RuntimeError'
#code='import sys; sys.exit(42)'
python -c "$code" >/dev/null 2>&1 
echo $?
mpiexec.hydra -n 1 python -c "$code" >/dev/null 2>&1 
echo $?
mpiexec.gforker -n 1 python -c "$code" >/dev/null 2>&1 
echo $?

The code raise KeyboardInterrupt behaves in a special way, see the end of this comment.

If you run the exact lines above in bash, you should get:

130
2
0

If you uncomment #code='raise RuntimeError', then you should get

1
1
1

If you uncomment #code='import sys; ...', then you should get

42
42
42

I'll detail an important Python-specific piece of information: Upon an unhandled KeyboardInterupt exception, Python basically does the following: https://github.com/python/cpython/blob/main/Modules/main.c#L732

setsig(SIGINT, SIG_DFL);
kill(getpid(), SIGINT);
return SIGINT + 128;

dalcinl avatar Aug 27 '24 07:08 dalcinl

I see. Thanks for the simple reproducer! I can make gforker to return the same as hydra. Essentially, in https://github.com/pmodels/mpich/blob/4f8e4a968012d6e59e22b606a76f613d61a7a91c/src/pm/util/process.c#L407-L427 Line 414, we should just add rc = prog_stat;.

hzhou avatar Aug 28 '24 21:08 hzhou

Indeed, the following patch makes hydra and gforker behave identically with my reproducer. I think it would be a good idea to push these changes to make the two PMs behave consistently.

diff --git a/src/pm/util/process.c b/src/pm/util/process.c
index b5b1ad2d1e..c0757144a2 100644
--- a/src/pm/util/process.c
+++ b/src/pm/util/process.c
@@ -411,6 +411,7 @@ void MPIE_ProcessSetExitStatus(ProcessState * pState, int prog_stat)
         rc = WEXITSTATUS(prog_stat);
     }
     if (WIFSIGNALED(prog_stat)) {
+        rc = prog_stat;
         sigval = WTERMSIG(prog_stat);
     }
     pState->exitStatus.exitStatus = rc;

However, after further consideration, I do not think this is the root of problem in my CI failures. When executing mpiexec through Python's subprocess module, it seems things are handled different. Otherwise, hydra and gforker should behave differently, and that is not the case (most of the time), and I do not have an explanation for it. I'll keep investigating.

dalcinl avatar Aug 30 '24 11:08 dalcinl

At this point I believe these weird 221 return codes I'm getting in GitHub Actions are something specific to such virtualized environment.

dalcinl avatar Oct 25 '24 13:10 dalcinl