mpich
mpich copied to clipboard
Hydra hangs waiting for children forked by MPI ranks (instead of only waiting for its own children)
Hydra is waiting for detached forked processes even after all MPI ranks have exited (and thus become zombies). The following code reproduces the issue on mpich 3.3.2 when using more than one rank.
#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <mpi.h>
int main(int argc, char **argv) {
int rank;
pid_t parent_id, child_id;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
parent_id = getpid();
child_id = fork();
// detaching child from parent
if (child_id < 0) {
perror("parent can't fork");
return -1;
}
if (child_id == 0) {
// detaching child from parent
if (setsid() < 0 || chdir("/") < 0) {
perror("child can't set new session and/or chdir to root");
return -2;
}
// closing all inputs and outputs
fclose(stdin);
fclose(stdout);
fclose(stderr);
sleep(5);
// signal parent to continue after init
kill(parent_id, SIGCONT);
// do extra work (e.g., a cleanup)
sleep(10);
return 0;
} else {
printf("Waiting for child to finish init\n");
kill(parent_id, SIGSTOP);
printf("Child init complete\n");
}
}
MPI_Finalize();
printf("Rank %d: exiting now - Hydra still waiting for child\n", rank);
return 0;
}
This is insufficient --
// closing all inputs and outputs
fclose(stdin);
fclose(stdout);
fclose(stderr);
There are more io descriptors open between the MPI process and process manager. I am not sure what is the best practice, but something like --
// closing all inputs and outputs
for (int i = 0; i < 256; i++) close(i);
should truly detach the forked child.
Close the issue due to staleness. @bnicolae If the issue is still relevant, please re-open