mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Hydra hangs waiting for children forked by MPI ranks (instead of only waiting for its own children)

Open bnicolae opened this issue 5 years ago • 1 comments

Hydra is waiting for detached forked processes even after all MPI ranks have exited (and thus become zombies). The following code reproduces the issue on mpich 3.3.2 when using more than one rank.

#include <stdio.h>
#include <signal.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char **argv) {
    int rank;
    pid_t parent_id, child_id;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (rank == 0) {
        parent_id = getpid();
        child_id = fork();
        // detaching child from parent
        if (child_id < 0) {
            perror("parent can't fork");
            return -1;
        }
        if (child_id == 0) {
            // detaching child from parent
            if (setsid() < 0 || chdir("/") < 0) {
                perror("child can't set new session and/or chdir to root");
                return -2;
            }
            // closing all inputs and outputs
            fclose(stdin); 
            fclose(stdout);
            fclose(stderr);
            sleep(5);
            // signal parent to continue after init
            kill(parent_id, SIGCONT);
            // do extra work (e.g., a cleanup)
            sleep(10);
            return 0;
        } else {
            printf("Waiting for child to finish init\n");
            kill(parent_id, SIGSTOP);
            printf("Child init complete\n");
        }
    }
    MPI_Finalize();
    printf("Rank %d: exiting now - Hydra still waiting for child\n", rank);
    return 0;
}

bnicolae avatar Nov 24 '20 18:11 bnicolae

This is insufficient --

            // closing all inputs and outputs
            fclose(stdin); 
            fclose(stdout);
            fclose(stderr);

There are more io descriptors open between the MPI process and process manager. I am not sure what is the best practice, but something like --

            // closing all inputs and outputs
            for (int i = 0; i < 256; i++) close(i); 

should truly detach the forked child.

hzhou avatar Aug 10 '22 22:08 hzhou

Close the issue due to staleness. @bnicolae If the issue is still relevant, please re-open

hzhou avatar Oct 12 '22 03:10 hzhou