dlopen of libucm_cuda fails due to symbols from libucs not being found
Describe the bug
When initializing the memtype cache, UCX fails to dlopen libucm_cuda because it depends on symbols from libucs, but does not actually link to it. Normally this is fine. However if UCX was itself dlopened, which happens in cases like Python loading an extension module, libucs won't be in ld.so's scope. This generates a log message like
[1649166537.059209] [partita:13555:0] module.c:256 UCX DEBUG dlopen('/home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0', mode=0x1001) failed: /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0: undefined symbol: ucs_status_string
In our environment, this leads to a crash when trying to send a CUDA buffer. Because the memtype cache is created, but empty, the eager short protocol is used with the TCP transport. This tries to memcpy the buffer (uct_tcp_ep_am_short -> uct_am_short_fill_data), which of course causes a segfault. But I haven't been able to minimize this into a standalone example (in our case, it depends on ucx-py and a proprietary application).
A few workarounds do work:
- Using patchelf to make
libucm_cuda.sodepend onlibucs.soworks. - LD_PRELOADing libucs.so also works.
- Forcing Python to load the extension module with RTLD_GLOBAL also works.
Ideally, we would just make libucm_cuda.so depend on libucs.so to avoid this, though.
Steps to Reproduce
As mentioned, I can't get a reproducer to crash. This reproducer will instead demonstrate the dlopen failed message described above. The example application can be found at: https://github.com/lidavidm/ucx-bug-report
It is simply ucp_client_server.c repackaged in such a way that it is dlopened instead of directly linked to.
- UCX version used: 1.12.0 (from rapidsai conda)
# UCT version=1.12.0 revision d367332 # configured with: --build=x86_64-conda-linux-gnu --host=x86_64-conda-linux-gnu --prefix=/home/lidavidm/miniconda3/envs/ucx --with-sysroot --enable-cma --enable-mt --enable-numa --with-gnu-ld --with-cuda=/usr/local/cuda
[09:57:17] lidavidm@partita /home/lidavidm/Code/upstream/ucx/report3/build
> [1649167037.106580] [partita:13863:0] ucp_context.c:1779 UCX INFO UCP version is 1.12 (release 0) (ucx)
[1649167037.157884] [partita:13863:0] ucp_worker.c:1867 UCX INFO ep_cfg[0]: rma(cuda_copy/cuda);
[1649167037.157936] [partita:13863:0] parser.c:1916 UCX INFO UCX_* env variable: UCX_LOG_LEVEL=info
[1649167037.160414] [partita:13863:0] ucp_worker.c:1867 UCX INFO ep_cfg[0]: rma(cuda_copy/cuda);
server is listening on IP 0.0.0.0 port 13337
Waiting for connection...
[09:57:18] lidavidm@partita /home/lidavidm/Code/upstream/ucx/report3/build
> env UCX_LOG_LEVEL=trace ./main -a 127.0.0.1 2>&1 | rg 'dlopen|loaded' (ucx)
[1649167042.749723] [partita:13895:0] init.c:116 UCX DEBUG /home/lidavidm/miniconda3/envs/ucx/lib/libucs.so.0 loaded at 0x7f57ce0f2000
[1649167042.750529] [partita:13895:0] module.c:180 UCX TRACE loaded /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libuct_cuda.so.0.0.0 [0x558e41270880]
[1649167042.750588] [partita:13895:0] module.c:180 UCX TRACE loaded /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libuct_cma.so.0.0.0 [0x558e41272090]
[1649167042.764901] [partita:13895:0] module.c:256 UCX DEBUG dlopen('/home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0', mode=0x1001) failed: /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0: undefined symbol: ucs_status_string
[1649167042.764935] [partita:13895:0] module.c:256 UCX DEBUG dlopen('/home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0', mode=0x1001) failed: /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0: undefined symbol: ucs_status_string
Server received a connection request from client at address 127.0.0.1:54846
[1649167042.765132] [partita:13863:0] ucp_worker.c:1867 UCX INFO ep_cfg[1]: stream(tcp/lo);
Server: iteration #1
UCX data message was received
----- UCP TEST SUCCESS -------
ABCDEFGHIJKLMNO.
------------------------------
Waiting for connection...
Setup and versions
$ uname -a
Linux partita 5.4.0-91-generic #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Ubuntu 18.04.6 LTS \n \l
$ nvidia-smi
Tue Apr 5 09:59:05 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro T2000 wi... On | 00000000:01:00.0 Off | N/A |
| N/A 59C P8 1W / N/A | 865MiB / 4096MiB | 12% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Additional Information
From LD_DEBUG=all we can see
14572: symbol=ucs_status_string; lookup in file=./main [0]
14572: symbol=ucs_status_string; lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
14572: symbol=ucs_status_string; lookup in file=/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0]
14572: symbol=ucs_status_string; lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
14572: symbol=ucs_status_string; lookup in file=/lib/x86_64-linux-gnu/libgcc_s.so.1 [0]
14572: symbol=ucs_status_string; lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
14572: symbol=ucs_status_string; lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
14572: symbol=ucs_status_string; lookup in file=/home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0 [0]
14572: symbol=ucs_status_string; lookup in file=/home/lidavidm/miniconda3/envs/ucx/lib/./libucm.so.0 [0]
14572: symbol=ucs_status_string; lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
14572: symbol=ucs_status_string; lookup in file=/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0]
14572: symbol=ucs_status_string; lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
14572: symbol=ucs_status_string; lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
14572: symbol=ucs_status_string; lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
14572: symbol=ucs_status_string; lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
14572: symbol=ucs_status_string; lookup in file=/lib/x86_64-linux-gnu/librt.so.1 [0]
14572: /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0: error: symbol lookup error: undefined symbol: ucs_status_string (fatal)
The linker searches the dependencies of the main application, then the dependencies of libucm_cuda.so - but neither contain libucs.so.