quda
quda copied to clipboard
QUDA doesn't link against UCX when NVSHMEM with UCX is used resulting in undefined references to ucp_*
I'm compiling QUDA 1.1.0 using GCC 10.3.0., CUDA 11.3.1, OpenMPI, (external) Eigen, 3.3.9, and (external) NVSHMEM 2.4.1 on CentOS 8.4. The build is configured with:
cmake <other-options> -DQUDA_GPU_ARCH=sm_80 -DQUDA _MPI=ON -DQUDA_NVSHMEM=ON -DQUDA_NVSHMEM_HOME=$EBROOTNVSHMEM -DQUDA_DOWNLOAD_EIGEN=OFF -DPROPAGATED_FLAGS=" " -DMPIEXEC_EXECUTABLE="$(which srun)"
Build fails in the linking phase with undefined symbols to ucp_*. NVSHMEM is compiled with the UCX transport layer.
$ /apps/GCCcore/10.3.0/bin/g++ -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib64 -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib -L/apps/CUDA/11.3.1/lib64 -L/apps/CUDA/11.3.1/lib -L/apps/Python/3.9.5-GCCcore-10.3.0/lib64 -L/apps/Python/3.9.5-GCCcore-10.3.0/lib -L/apps/FFTW/3.3.9-gompi-2021a/lib64 -L/apps/FFTW/3.3.9-gompi-2021a/lib -L/apps/ScaLAPACK/2.1.0-gompi-2021a-fb/lib64 -L/apps/ScaLAPACK/2.1.0-gompi-2021a-fb/lib -L/apps/FlexiBLAS/3.0.4-GCC-10.3.0/lib64 -L/apps/FlexiBLAS/3.0.4-GCC-10.3.0/lib -L/apps/GCCcore/10.3.0/lib64 -L/apps/GCCcore/10.3.0/lib -Wl,-rpath -Wl,/apps/hwloc/2.4.1-GCCcore-10.3.0/lib -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath -Wl,/apps/OpenMPI/4.1.1-GCC-10.3.0/lib -Wl,--enable-new-dtags -L/mnt/tier2/apps/hwloc/2.4.1-GCCcore-10.3.0/lib -L/usr/lib64 -L/mnt/tier2/apps/OpenMPI/4.1.1-GCC-10.3.0/lib CMakeFiles/hisq_stencil_test.dir/hisq_stencil_test.cpp.o -o hisq_stencil_test -Wl,-rpath,/dev/shm/QUDA/1.1.0/foss-2021a-CUDA-11.3.1/easybuild_obj/lib::::::::::::::::::::::: libquda_test.a ../lib/libquda.so /apps/CUDA/11.3.1/lib/libcudart_static.a -lpthread -ldl /usr/lib64/librt.so /usr/lib64/libcuda.so /apps/CUDA/11.3.1/lib/libcublas.so /apps/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib -lnvshmem -L"/apps/CUDA/11.3.1/targets/x86_64-linux/lib/stubs" -L"/apps/CUDA/11.3.1/targets/x86_64-linux/lib"
../lib/libquda.so: error: undefined reference to 'ucp_rkey_destroy'
../lib/libquda.so: error: undefined reference to 'ucp_worker_flush_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_rkey_buffer_release'
../lib/libquda.so: error: undefined reference to 'ucp_rkey_pack'
../lib/libquda.so: error: undefined reference to 'ucp_worker_set_am_recv_handler'
../lib/libquda.so: error: undefined reference to 'ucp_worker_query'
../lib/libquda.so: error: undefined reference to 'ucp_worker_create'
../lib/libquda.so: error: undefined reference to 'ucp_mem_map'
../lib/libquda.so: error: undefined reference to 'ucp_init_version'
../lib/libquda.so: error: undefined reference to 'ucp_config_modify'
../lib/libquda.so: error: undefined reference to 'ucp_config_read'
../lib/libquda.so: error: undefined reference to 'ucp_am_send_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_put_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_am_data_release'
../lib/libquda.so: error: undefined reference to 'ucp_config_release'
../lib/libquda.so: error: undefined reference to 'ucp_cleanup'
../lib/libquda.so: error: undefined reference to 'ucp_worker_destroy'
../lib/libquda.so: error: undefined reference to 'ucp_request_check_status'
../lib/libquda.so: error: undefined reference to 'ucp_get_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_worker_fence'
../lib/libquda.so: error: undefined reference to 'ucp_atomic_op_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_worker_progress'
../lib/libquda.so: error: undefined reference to 'ucp_ep_rkey_unpack'
../lib/libquda.so: error: undefined reference to 'ucp_atomic_post'
../lib/libquda.so: error: undefined reference to 'ucp_request_free'
../lib/libquda.so: error: undefined reference to 'ucp_ep_close_nb'
../lib/libquda.so: error: undefined reference to 'ucp_ep_create'
../lib/libquda.so: error: undefined reference to 'ucp_worker_release_address'
../lib/libquda.so: error: undefined reference to 'ucp_worker_get_address'
../lib/libquda.so: error: undefined reference to 'ucp_mem_unmap'
collect2: error: ld returned 1 exit status
NVSHMEM is built with:
$ make -j 1 NVSHMEM_MPI_SUPPORT=1 NVSHMEM_UCX_SUPPORT=1 UCX_HOME=$EBROOTUCX NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" NVSHMEM_USE _NCCL=1 NVSHMEM_PMIX_SUPPORT=1
The issue is that QUDA doesn't link against UCX, -L$(UCX_HOME)/lib -lucs -lucp.
Looking into common.mk of NVSHMEM, I see that intention of NVIDIA is that codes that use it should link against UCX themselves, i.e., they expect QUDA to link against it.
ifeq ($(NVSHMEM_UCX_SUPPORT), 1)
TESTLDFLAGS += -L$(UCX_HOME)/lib -lucs -lucp
endif
I would add the flags myself but CMakeLists.txt of QUDA doesn't provide such an option.
Yes, UCX support in NVSHMEM is not supported in QUDA yet. QUDA uses cmake and nvshmem doesn't so any usage requirements propagation is limited.
You should be able to specify additional linker flags using
CMAKE_EXE_LINKER_FLAGS
Thank you for the workaround. I have tested it and it worked well.