mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Using malloc_shared with MPI_File_write_at_all on Intel GPUs

Open colleeneb opened this issue 8 months ago • 8 comments

Hello,

This is to report an issue we are seeing with MPICH on Intel GPUs (related to an IOR issue from @pkcoff). A small reproducer is below. The code uses Intel SYCL's malloc_shared as a buffer to send to MPI_File_write_at_all. The code works fine with regular malloc. It also works fine on one node but crashes on 2 nodes with errors of "Abort(15) on node 1 (rank 1 in comm 496): Fatal error in internal_Issend: Other MPI error". Is it expected that we can't pass memory allocated with SYCL's malloc_shared as buffers to MPI I/O functions like MPI_File_write_at_all for multi-node jobs?

Reproducer

> cat t.cpp
#include <mpi.h>
#include <math.h>
#include <stdio.h>
#include <sycl/sycl.hpp>

int main(){
    MPI_Init(NULL, NULL);

    sycl::queue syclQ{sycl::gpu_selector_v };

    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    int numProcs;
    MPI_Comm_size(MPI_COMM_WORLD, &numProcs);

    MPI_File outFile;
    MPI_File_open(
        MPI_COMM_WORLD, "test", MPI_MODE_CREATE | MPI_MODE_WRONLY,
        MPI_INFO_NULL, &outFile);

    // regular malloc like below works, malloc_shared fails   
    //    char *bufToWrite = (char*)malloc(sizeof(char)*4);  
    char *bufToWrite = (char*)sycl::malloc_shared<char>(4, syclQ);
    snprintf(bufToWrite, 4, "%3d", rank);
    printf("%s\n", bufToWrite);
    MPI_File_write_at_all(
                          outFile, rank * 3,
                          bufToWrite, 3, MPI_CHAR, MPI_STATUS_IGNORE);

    MPI_File_close(&outFile);
    MPI_Finalize();
}
> mpicc -fsycl t.cpp
# run on two nodes, one rank per node
> mpirun -n 2 -ppn 1 ./a.out 

Expected output

It should run like:

> mpirun -n 2 -ppn 1 ./a.out
  1
  0

We expect it to run, since malloc_shared is accessible on the host. This works fine with 2 MPI ranks on 1 node as well.

Actual output

> mpirun -n 2 -ppn 1 ./a.out
  1
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
cxil_map: write error
Abort(15) on node 1 (rank 1 in comm 496): Fatal error in internal_Issend: Other MPI error
  0
x1921c6s1b0n0.hostmgmt2000.cm.americas.sgi.com: rank 1 exited with code 15

Note that above was with the default of ZE_FLAT_DEVICE_HIERARCHY=FLAT. If we use ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE is also fails:

>  mpirun -n 2 -ppn 1 ./a.out
free(): invalid pointer
x1921c6s1b0n0.hostmgmt2000.cm.americas.sgi.com: rank 1 died from signal 6
x1921c5s5b0n0.hostmgmt2000.cm.americas.sgi.com: rank 0 died from signal 15

colleeneb avatar Jun 19 '24 16:06 colleeneb