symbol lookup error: /usr/local/lib/ucx/libuct_cuda.so.0: undefined symbol: nvmlDeviceGetGpuFabricInfo
NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 12.4
I'm wondering if ucx can run on the A800? I don't seem to have the nvmlDeviceGetGpuFabricInfo function in my driver.
Driver Version: 470.161.03 corresponds to CUDA toolkit 11.4. Is it possible to upgrade the driver on the target system, at least to 525.x.x (corresponds to CUDA 12.0)? Otherwise, please use UCX built against CUDA11.
If you build UCX from sources, you can also try to use this patch https://github.com/openucx/ucx/pull/10680. It should work without changing the driver.
If you build UCX from sources, you can also try to use this patch #10680. It should work without changing the driver.
I have already tried this version, but it still reports the error: undefined symbol: nvmlDeviceGetGpuFabricInfo.
Is there any other solution besides reinstalling the driver?
If you build UCX from sources, you can also try to use this patch #10680. It should work without changing the driver.
I have already tried this version, but it still reports the error: undefined symbol: nvmlDeviceGetGpuFabricInfo.
Not sure that this is possible with this patch. It should disable usage of NVML in this case.
Could you please attach the logs, running with UCX_LOG_LEVEL=debug.