ucc icon indicating copy to clipboard operation
ucc copied to clipboard

ibucc_tl_cuda.so: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType

Open zasdfgbnm opened this issue 3 years ago • 7 comments
trafficstars

I am seeing this error:

libucc_tl_cuda.so: undefined symbol: nvmlDeviceGetNvLinkRemoteDeviceType

Thanks to @crcrpar who figured out that this is a new API https://github.com/NVIDIA/nvidia-settings/blame/5b455b89bb73f56818c84444806bc9c928da67ac/src/nvml.h#L6009-L6026

For older versions of drivers, is it possible to use other APIs to achieve similar functionality? Or at least detect the version and throw a kinder error message?

cc: @ptrblck

zasdfgbnm avatar May 03 '22 17:05 zasdfgbnm

@bureddy Can you take a look, please.

jladd-mlnx avatar May 03 '22 18:05 jladd-mlnx

Hi @zasdfgbnm actually existing autotool code does check for the presence of that function at compile time. Here: https://github.com/openucx/ucc/blob/e96a6de3def951748a8c1bd9f3d074f73c594f1f/config/m4/cuda.m4#L79. So i guess it was available during compile time and in your case it is not available at runtime. This implies compile/runtime cuda versions mismatch. Could you plz check the env and confirm?

vspetrov avatar May 04 '22 12:05 vspetrov

We're seeing the undefined symbol message when we run a container which has CUDA 11.6 on a host with an older driver

crcrpar avatar May 04 '22 15:05 crcrpar

what is the driver version? is it possible to choose the right cuda toolkit version in container? https://docs.nvidia.com/deploy/cuda-compatibility/index.html otherwise, I think you need to have cuda-compat-11.6 in the container for compatibility.

bureddy avatar May 04 '22 16:05 bureddy

The KMD was 460.73.01, UMD 510.47.03, and forward compat was used.

ptrblck avatar May 04 '22 18:05 ptrblck

It seems no forward compat for NVML (libnvidia-ml.so) unfortunately.

bureddy avatar May 04 '22 23:05 bureddy

@bureddy what do you think about @zasdfgbnm's 2nd question?

Or at least detect the version and throw a kinder error message?

crcrpar avatar May 05 '22 00:05 crcrpar