MPI.jl
MPI.jl copied to clipboard
UCX incompatible with CUDA.jl memory pool
I am not sure whether this is a MPI.jl issue or something from our local supercomputer, but I have a failing Alltoall
in my Julia code, whereas the identical code in C++ works, showing that the problem does not lie in our MPI or CUDA install. I do not really know how to proceed from here. I got excellent help in making sure that the libraries are set up correctly at https://discourse.julialang.org/t/cuda-aware-mpi-works-on-system-but-not-for-julia/75060, but the problem remains The error is:
[1642966632.503811] [gcn21:3820255:0] gdr_copy_md.c:122 UCX ERROR gdr_pin_buffer failed. length :65536 ret:22
signal (11): Segmentation fault
in expression starting at /gpfs/scratch1/shared/chiel/MicroHH.jl/test/alltoall_test.jl:32
[1642966632.504314] [gcn21:3820254:0] gdr_copy_md.c:122 UCX ERROR gdr_pin_buffer failed. length :65536 ret:22
signal (11): Segmentation fault
in expression starting at /gpfs/scratch1/shared/chiel/MicroHH.jl/test/alltoall_test.jl:32
uct_gdr_copy_mkey_pack at /tmp/jenkins/build/UCXCUDA/1.10.0/GCCcore-10.3.0-CUDA-11.3.1/ucx-1.10.0/src/uct/cuda/gdr_copy/gdr_copy_md.c:68
uct_gdr_copy_mkey_pack at /tmp/jenkins/build/UCXCUDA/1.10.0/GCCcore-10.3.0-CUDA-11.3.1/ucx-1.10.0/src/uct/cuda/gdr_copy/gdr_copy_md.c:68
ucp_mem_type_pack at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/dt/dt.c:87
ucp_dt_pack at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/dt/dt.c:123
ucp_mem_type_pack at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/dt/dt.c:87
ucp_dt_pack at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/dt/dt.c:123
ucp_tag_pack_eager_common at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:31 [inlined]
ucp_tag_pack_eager_only_dt at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:44
ucp_tag_pack_eager_common at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:31 [inlined]
ucp_tag_pack_eager_only_dt at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:44
uct_mm_ep_am_common_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/sm/mm/base/mm_ep.c:292 [inlined]
uct_mm_ep_am_bcopy at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/sm/mm/base/mm_ep.c:353
uct_ep_am_bcopy at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/api/uct.h:2650 [inlined]
ucp_do_am_bcopy_single at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/proto/proto_am.inl:37 [inlined]
ucp_tag_eager_bcopy_single at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:133
uct_mm_ep_am_common_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/sm/mm/base/mm_ep.c:292 [inlined]
uct_mm_ep_am_bcopy at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/sm/mm/base/mm_ep.c:353
uct_ep_am_bcopy at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/uct/api/uct.h:2650 [inlined]
ucp_do_am_bcopy_single at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/proto/proto_am.inl:37 [inlined]
ucp_tag_eager_bcopy_single at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/eager_snd.c:133
ucp_request_try_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/core/ucp_request.inl:242 [inlined]
ucp_request_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/core/ucp_request.inl:267 [inlined]
ucp_tag_send_req at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/tag_send.c:116 [inlined]
ucp_tag_send_nbx at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/tag_send.c:296
mca_pml_ucx_send at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/openmpi/mca_pml_ucx.so (unknown line)
ompi_coll_base_sendrecv_actual at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
ompi_coll_base_alltoall_intra_pairwise at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
ompi_coll_tuned_alltoall_intra_dec_fixed at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/openmpi/mca_coll_tuned.so (unknown line)
MPI_Alltoall at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
Alltoall! at /home/chiel/.julia/packages/MPI/08SPr/src/collective.jl:480
unknown function (ip: 0x14eefc4fc48f)
ucp_request_try_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/core/ucp_request.inl:242 [inlined]
ucp_request_send at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/core/ucp_request.inl:267 [inlined]
ucp_tag_send_req at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/tag_send.c:116 [inlined]
ucp_tag_send_nbx at /tmp/jenkins/build/UCX/1.10.0/GCCcore-10.3.0/ucx-1.10.0/src/ucp/tag/tag_send.c:296
mca_pml_ucx_send at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/openmpi/mca_pml_ucx.so (unknown line)
ompi_coll_base_sendrecv_actual at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
ompi_coll_base_alltoall_intra_pairwise at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
ompi_coll_tuned_alltoall_intra_dec_fixed at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/openmpi/mca_coll_tuned.so (unknown line)
MPI_Alltoall at /home/chiel/.local/easybuild/Centos8/2021/software/OpenMPI/4.1.1-GCC-10.3.0-CUDA-11.3.1/lib/libmpi.so (unknown line)
The code that triggers this error is:
using CUDA
using MPI
np = 2
MPI.Init()
comm = MPI.COMM_WORLD
mpiid = MPI.Comm_rank(comm)
print("The MPI rank is: $mpiid\n")
device!(mpiid)
cuda_bb = ENV["JULIA_CUDA_USE_BINARYBUILDER"]
print("The CUDA device is: $(device()), JULIA_CUDA_USE_BINARYBUILDER is $cuda_bb\n")
n = 1024
data_cpu = rand(n)
data_out_cpu = similar(data_cpu)
data = CuArray(data_cpu)
data_out = similar(data)
# Test the alltoall on the CPU
mpi_data_cpu = MPI.UBuffer(data_cpu, 512)
mpi_data_out_cpu = MPI.UBuffer(data_out_cpu, 512)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)
@time MPI.Alltoall!(mpi_data_cpu, mpi_data_out_cpu, comm)
# Test the alltoall on the GPU
print("$mpiid has CUDA: $(MPI.has_cuda())\n")
mpi_data = MPI.UBuffer(data, 512)
mpi_data_out = MPI.UBuffer(data_out, 512)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)
@time MPI.Alltoall!(mpi_data, mpi_data_out, comm)
# Close the MPI.
MPI.Finalize()
The equivalent working C++ code is:
#include <iostream>
#include <vector>
#include <mpi.h>
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <chrono>
int main()
{
MPI_Init(NULL, NULL);
int n, id;
MPI_Comm_size(MPI_COMM_WORLD, &n);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
const size_t size_tot = 1024*1024*1024;
const size_t size_max = size_tot / n;
// CPU TEST
std::vector<double> a_cpu_in (size_tot);
std::vector<double> a_cpu_out(size_tot);
std::fill(a_cpu_in.begin(), a_cpu_in.end(), id);
std::cout << id << ": Starting CPU all-to-all\n";
auto time_start = std::chrono::high_resolution_clock::now();
MPI_Alltoall(
a_cpu_in .data(), size_max, MPI_DOUBLE,
a_cpu_out.data(), size_max, MPI_DOUBLE,
MPI_COMM_WORLD);
auto time_end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();
std::cout << id << ": Finished CPU all-to-all in " << std::to_string(duration) << " (ms)\n";
// GPU TEST
int id_local = id % 4;
cudaSetDevice(id_local);
double* a_gpu_in;
double* a_gpu_out;
cudaMalloc((void **)&a_gpu_in , size_tot * sizeof(double));
cudaMalloc((void **)&a_gpu_out, size_tot * sizeof(double));
cudaMemcpy(a_gpu_in, a_cpu_in.data(), size_tot*sizeof(double), cudaMemcpyHostToDevice);
int id_gpu;
cudaGetDevice(&id_gpu);
std::cout << id << ", " << id_local << ", " << id_gpu << ": Starting GPU all-to-all\n";
time_start = std::chrono::high_resolution_clock::now();
MPI_Alltoall(
a_gpu_in , size_max, MPI_DOUBLE,
a_gpu_out, size_max, MPI_DOUBLE,
MPI_COMM_WORLD);
time_end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration<double, std::milli>(time_end-time_start).count();
std::cout << id << ", " << id_local << ", " << id_gpu << ": Finished GPU all-to-all in " << std::to_string(duration) << " (ms)\n";
MPI_Finalize();
return 0;
}
In the discussion on Discourse somebody suggested to use export JULIA_CUDA_MEMORY_POOL=none
and this solves the problem. I do not know though whether this is a bug, because it would be great if the pool and the CUDA-aware MPI can be combined.
Have you tried https://juliaparallel.github.io/MPI.jl/stable/knownissues/#Memory-cache
Yes, I tried that as well. Only the export JULIA_CUDA_MEMORY_POOL=none
solves my problems.
Ah ok, upstream issue is https://github.com/openucx/ucx/issues/7110