quda icon indicating copy to clipboard operation
quda copied to clipboard

QUDA doesn't link against UCX when NVSHMEM with UCX is used resulting in undefined references to ucp_*

Open robert-mijakovic opened this issue 3 years ago • 2 comments

I'm compiling QUDA 1.1.0 using GCC 10.3.0., CUDA 11.3.1, OpenMPI, (external) Eigen, 3.3.9, and (external) NVSHMEM 2.4.1 on CentOS 8.4. The build is configured with:

 cmake <other-options> -DQUDA_GPU_ARCH=sm_80 -DQUDA  _MPI=ON -DQUDA_NVSHMEM=ON -DQUDA_NVSHMEM_HOME=$EBROOTNVSHMEM -DQUDA_DOWNLOAD_EIGEN=OFF -DPROPAGATED_FLAGS=" " -DMPIEXEC_EXECUTABLE="$(which srun)"

Build fails in the linking phase with undefined symbols to ucp_*. NVSHMEM is compiled with the UCX transport layer.

$ /apps/GCCcore/10.3.0/bin/g++ -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib64 -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib -L/apps/CUDA/11.3.1/lib64 -L/apps/CUDA/11.3.1/lib -L/apps/Python/3.9.5-GCCcore-10.3.0/lib64 -L/apps/Python/3.9.5-GCCcore-10.3.0/lib -L/apps/FFTW/3.3.9-gompi-2021a/lib64 -L/apps/FFTW/3.3.9-gompi-2021a/lib -L/apps/ScaLAPACK/2.1.0-gompi-2021a-fb/lib64 -L/apps/ScaLAPACK/2.1.0-gompi-2021a-fb/lib -L/apps/FlexiBLAS/3.0.4-GCC-10.3.0/lib64 -L/apps/FlexiBLAS/3.0.4-GCC-10.3.0/lib -L/apps/GCCcore/10.3.0/lib64 -L/apps/GCCcore/10.3.0/lib -Wl,-rpath -Wl,/apps/hwloc/2.4.1-GCCcore-10.3.0/lib -Wl,-rpath -Wl,/usr/lib64 -Wl,-rpath -Wl,/apps/OpenMPI/4.1.1-GCC-10.3.0/lib -Wl,--enable-new-dtags -L/mnt/tier2/apps/hwloc/2.4.1-GCCcore-10.3.0/lib -L/usr/lib64 -L/mnt/tier2/apps/OpenMPI/4.1.1-GCC-10.3.0/lib CMakeFiles/hisq_stencil_test.dir/hisq_stencil_test.cpp.o -o hisq_stencil_test  -Wl,-rpath,/dev/shm/QUDA/1.1.0/foss-2021a-CUDA-11.3.1/easybuild_obj/lib::::::::::::::::::::::: libquda_test.a ../lib/libquda.so /apps/CUDA/11.3.1/lib/libcudart_static.a -lpthread -ldl /usr/lib64/librt.so /usr/lib64/libcuda.so /apps/CUDA/11.3.1/lib/libcublas.so /apps/OpenMPI/4.1.1-GCC-10.3.0/lib/libmpi.so -L/apps/NVSHMEM/2.4.1-gompi-2021a-CUDA-11.3.1/lib -lnvshmem  -L"/apps/CUDA/11.3.1/targets/x86_64-linux/lib/stubs" -L"/apps/CUDA/11.3.1/targets/x86_64-linux/lib"
../lib/libquda.so: error: undefined reference to 'ucp_rkey_destroy'
../lib/libquda.so: error: undefined reference to 'ucp_worker_flush_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_rkey_buffer_release'
../lib/libquda.so: error: undefined reference to 'ucp_rkey_pack'
../lib/libquda.so: error: undefined reference to 'ucp_worker_set_am_recv_handler'
../lib/libquda.so: error: undefined reference to 'ucp_worker_query'
../lib/libquda.so: error: undefined reference to 'ucp_worker_create'
../lib/libquda.so: error: undefined reference to 'ucp_mem_map'
../lib/libquda.so: error: undefined reference to 'ucp_init_version'
../lib/libquda.so: error: undefined reference to 'ucp_config_modify'
../lib/libquda.so: error: undefined reference to 'ucp_config_read'
../lib/libquda.so: error: undefined reference to 'ucp_am_send_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_put_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_am_data_release'
../lib/libquda.so: error: undefined reference to 'ucp_config_release'
../lib/libquda.so: error: undefined reference to 'ucp_cleanup'
../lib/libquda.so: error: undefined reference to 'ucp_worker_destroy'
../lib/libquda.so: error: undefined reference to 'ucp_request_check_status'
../lib/libquda.so: error: undefined reference to 'ucp_get_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_worker_fence'
../lib/libquda.so: error: undefined reference to 'ucp_atomic_op_nbx'
../lib/libquda.so: error: undefined reference to 'ucp_worker_progress'
../lib/libquda.so: error: undefined reference to 'ucp_ep_rkey_unpack'
../lib/libquda.so: error: undefined reference to 'ucp_atomic_post'
../lib/libquda.so: error: undefined reference to 'ucp_request_free'
../lib/libquda.so: error: undefined reference to 'ucp_ep_close_nb'
../lib/libquda.so: error: undefined reference to 'ucp_ep_create'
../lib/libquda.so: error: undefined reference to 'ucp_worker_release_address'
../lib/libquda.so: error: undefined reference to 'ucp_worker_get_address'
../lib/libquda.so: error: undefined reference to 'ucp_mem_unmap'
collect2: error: ld returned 1 exit status

NVSHMEM is built with:

$ make  -j 1 NVSHMEM_MPI_SUPPORT=1 NVSHMEM_UCX_SUPPORT=1 UCX_HOME=$EBROOTUCX NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80" NVSHMEM_USE    _NCCL=1 NVSHMEM_PMIX_SUPPORT=1

The issue is that QUDA doesn't link against UCX, -L$(UCX_HOME)/lib -lucs -lucp.

Looking into common.mk of NVSHMEM, I see that intention of NVIDIA is that codes that use it should link against UCX themselves, i.e., they expect QUDA to link against it.

ifeq ($(NVSHMEM_UCX_SUPPORT), 1)
TESTLDFLAGS += -L$(UCX_HOME)/lib -lucs -lucp
endif

I would add the flags myself but CMakeLists.txt of QUDA doesn't provide such an option.

robert-mijakovic avatar Dec 17 '21 13:12 robert-mijakovic

Yes, UCX support in NVSHMEM is not supported in QUDA yet. QUDA uses cmake and nvshmem doesn't so any usage requirements propagation is limited. You should be able to specify additional linker flags using CMAKE_EXE_LINKER_FLAGS

mathiaswagner avatar Dec 17 '21 13:12 mathiaswagner

Thank you for the workaround. I have tested it and it worked well.

robert-mijakovic avatar Dec 17 '21 18:12 robert-mijakovic