ucx icon indicating copy to clipboard operation
ucx copied to clipboard

symbol lookup error: /usr/local/lib/ucx/libuct_cuda.so.0: undefined symbol: nvmlDeviceGetGpuFabricInfo

Open CSEEduanyu opened this issue 7 months ago • 5 comments

NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 12.4
I'm wondering if ucx can run on the A800? I don't seem to have the nvmlDeviceGetGpuFabricInfo function in my driver.

CSEEduanyu avatar May 25 '25 12:05 CSEEduanyu

Driver Version: 470.161.03 corresponds to CUDA toolkit 11.4. Is it possible to upgrade the driver on the target system, at least to 525.x.x (corresponds to CUDA 12.0)? Otherwise, please use UCX built against CUDA11.

rakhmets avatar May 26 '25 11:05 rakhmets

If you build UCX from sources, you can also try to use this patch https://github.com/openucx/ucx/pull/10680. It should work without changing the driver.

rakhmets avatar May 26 '25 15:05 rakhmets

If you build UCX from sources, you can also try to use this patch #10680. It should work without changing the driver.

I have already tried this version, but it still reports the error: undefined symbol: nvmlDeviceGetGpuFabricInfo.

njw1123 avatar Jun 08 '25 11:06 njw1123

Is there any other solution besides reinstalling the driver?

njw1123 avatar Jun 08 '25 11:06 njw1123

If you build UCX from sources, you can also try to use this patch #10680. It should work without changing the driver.

I have already tried this version, but it still reports the error: undefined symbol: nvmlDeviceGetGpuFabricInfo.

Not sure that this is possible with this patch. It should disable usage of NVML in this case. Could you please attach the logs, running with UCX_LOG_LEVEL=debug.

rakhmets avatar Jun 10 '25 13:06 rakhmets