ucx icon indicating copy to clipboard operation
ucx copied to clipboard

dlopen of libucm_cuda fails due to symbols from libucs not being found

Open lidavidm opened this issue 3 years ago • 0 comments

Describe the bug

When initializing the memtype cache, UCX fails to dlopen libucm_cuda because it depends on symbols from libucs, but does not actually link to it. Normally this is fine. However if UCX was itself dlopened, which happens in cases like Python loading an extension module, libucs won't be in ld.so's scope. This generates a log message like

[1649166537.059209] [partita:13555:0]          module.c:256  UCX  DEBUG dlopen('/home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0', mode=0x1001) failed: /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0: undefined symbol: ucs_status_string

In our environment, this leads to a crash when trying to send a CUDA buffer. Because the memtype cache is created, but empty, the eager short protocol is used with the TCP transport. This tries to memcpy the buffer (uct_tcp_ep_am_short -> uct_am_short_fill_data), which of course causes a segfault. But I haven't been able to minimize this into a standalone example (in our case, it depends on ucx-py and a proprietary application).

A few workarounds do work:

  • Using patchelf to make libucm_cuda.so depend on libucs.so works.
  • LD_PRELOADing libucs.so also works.
  • Forcing Python to load the extension module with RTLD_GLOBAL also works.

Ideally, we would just make libucm_cuda.so depend on libucs.so to avoid this, though.

Steps to Reproduce

As mentioned, I can't get a reproducer to crash. This reproducer will instead demonstrate the dlopen failed message described above. The example application can be found at: https://github.com/lidavidm/ucx-bug-report

It is simply ucp_client_server.c repackaged in such a way that it is dlopened instead of directly linked to.

  • UCX version used: 1.12.0 (from rapidsai conda)
    # UCT version=1.12.0 revision d367332
    # configured with: --build=x86_64-conda-linux-gnu --host=x86_64-conda-linux-gnu --prefix=/home/lidavidm/miniconda3/envs/ucx --with-sysroot --enable-cma --enable-mt --enable-numa --with-gnu-ld --with-cuda=/usr/local/cuda
    
[09:57:17] lidavidm@partita /home/lidavidm/Code/upstream/ucx/report3/build  
> [1649167037.106580] [partita:13863:0]     ucp_context.c:1779 UCX  INFO  UCP version is 1.12 (release 0)           (ucx) 
[1649167037.157884] [partita:13863:0]      ucp_worker.c:1867 UCX  INFO    ep_cfg[0]: rma(cuda_copy/cuda); 
[1649167037.157936] [partita:13863:0]          parser.c:1916 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=info
[1649167037.160414] [partita:13863:0]      ucp_worker.c:1867 UCX  INFO    ep_cfg[0]: rma(cuda_copy/cuda); 
server is listening on IP 0.0.0.0 port 13337
Waiting for connection...

[09:57:18] lidavidm@partita /home/lidavidm/Code/upstream/ucx/report3/build  
> env UCX_LOG_LEVEL=trace ./main -a 127.0.0.1 2>&1 | rg 'dlopen|loaded'                                             (ucx) 
[1649167042.749723] [partita:13895:0]            init.c:116  UCX  DEBUG /home/lidavidm/miniconda3/envs/ucx/lib/libucs.so.0 loaded at 0x7f57ce0f2000
[1649167042.750529] [partita:13895:0]          module.c:180  UCX  TRACE loaded /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libuct_cuda.so.0.0.0 [0x558e41270880]
[1649167042.750588] [partita:13895:0]          module.c:180  UCX  TRACE loaded /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libuct_cma.so.0.0.0 [0x558e41272090]
[1649167042.764901] [partita:13895:0]          module.c:256  UCX  DEBUG dlopen('/home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0', mode=0x1001) failed: /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0: undefined symbol: ucs_status_string
[1649167042.764935] [partita:13895:0]          module.c:256  UCX  DEBUG dlopen('/home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0', mode=0x1001) failed: /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0: undefined symbol: ucs_status_string
Server received a connection request from client at address 127.0.0.1:54846
[1649167042.765132] [partita:13863:0]      ucp_worker.c:1867 UCX  INFO    ep_cfg[1]: stream(tcp/lo); 
Server: iteration #1
UCX data message was received


----- UCP TEST SUCCESS -------

ABCDEFGHIJKLMNO.


------------------------------

Waiting for connection...

Setup and versions

$ uname -a
Linux partita 5.4.0-91-generic #102~18.04.1-Ubuntu SMP Thu Nov 11 14:46:36 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Ubuntu 18.04.6 LTS \n \l
$ nvidia-smi
Tue Apr  5 09:59:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro T2000 wi...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   59C    P8     1W /  N/A |    865MiB /  4096MiB |     12%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Additional Information

From LD_DEBUG=all we can see

     14572:     symbol=ucs_status_string;  lookup in file=./main [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib/x86_64-linux-gnu/libgcc_s.so.1 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/home/lidavidm/miniconda3/envs/ucx/lib/./libucm.so.0 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib/x86_64-linux-gnu/libm.so.6 [0]
     14572:     symbol=ucs_status_string;  lookup in file=/lib/x86_64-linux-gnu/librt.so.1 [0]
     14572:     /home/lidavidm/miniconda3/envs/ucx/lib/ucx/libucm_cuda.so.0: error: symbol lookup error: undefined symbol: ucs_status_string (fatal)

The linker searches the dependencies of the main application, then the dependencies of libucm_cuda.so - but neither contain libucs.so.

lidavidm avatar Apr 05 '22 14:04 lidavidm