[fix] export failure with CUDA driver < 526 and pynvml>=11.5.0

Open CoderHam opened this issue 1 year ago • 2 comments

There is a bug that was fixed in the 526 driver release. For older driver versions the recommendation is to downgrade the pynvml version to 11.4.0 and use 11.5.0 only for drivers after 526.

Uses the legacy pynvml memory usage function even with pynvml 11.5.0 if the driver version is older than 526.

Mentioned in the issue as well: NVIDIA/TensorRT-LLM#808 (comment)

May 03 '24 00:05 CoderHam

Thanks for addressing the pynvml issue, relating to a driver version. @CoderHam can I know which doc(or link) you referred to determine the driver version (526)?

May 16 '24 03:05 jaedeok-nvidia

@jaedeok-nvidia took a while to dig through it but I followed the thread from https://forums.developer.nvidia.com/t/nvml-bug-nvmldevicegetcomputerunningprocesses-returns-compute-processes-for-all-gpu-devices/222337/2 and https://github.com/NVIDIA/k8s-device-plugin/issues/331#issuecomment-1498616763

This confirmed that the issue with missing symbols in the underlying nvml libraries prevents us from using the v2 api prior to driver 526.

May 16 '24 13:05 CoderHam

Hi @CoderHam , the changes are integrated in https://github.com/NVIDIA/TensorRT-LLM/pull/1688 and we've credited you as co-author, hence I'm closing this PR now, thanks a lot

May 28 '24 12:05 kaiyux