TensorRT-LLM
TensorRT-LLM copied to clipboard
ImportError and OSError When Importing tensorrt_llm in Python 3.10 Environment
I'm encountering an issue when trying to import the tensorrt_llm
package in my Python environment. I'm using Python 3.10, and the error seems to be related to missing dependencies and shared object files. Below are the details of the error message and my environment.
Environment
Python Version: 3.10 Operating System: Ubuntu tensorrt_llm Version: 0.7.0
Error Description
First, there's a warning indicating that a required package 'psutil' is not installed. The warning suggests installing 'pynvml>=11.5.0', but it's unclear if this is the correct package or version.
Warning Message:
[01/31/2024-11:19:55] [TRT-LLM] [W] A required package 'psutil' is not installed. Will not monitor the device memory usages. Please install the package first, e.g, 'pip install pynvml>=11.5.0'.
However, in my environment, psutil
has already been installed.
Following this warning, an OSError
occurs, mentioning that the libnccl.so.2
file cannot be found.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/miniconda3/envs/rag/lib/python3.10/site-packages/tensorrt_llm/__init__.py", line 61, in <module>
_init(log_level="error")
File "/home/ubuntu/miniconda3/envs/rag/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 47, in _init
_load_plugin_lib()
File "/home/ubuntu/miniconda3/envs/rag/lib/python3.10/site-packages/tensorrt_llm/plugin/plugin.py", line 34, in _load_plugin_lib
handle = ctypes.CDLL(plugin_lib_path(),
File "/home/ubuntu/miniconda3/envs/rag/lib/python3.10/ctypes/__init__.py", line 374, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libnccl.so.2: cannot open shared object file: No such file or directory
Additional Context
Any help or guidance on resolving these issues would be greatly appreciated.
Thank you!
Do you run in the docker image built by the docker file? If not, could you take a try? It looks you miss some packages in your environment.
Thank you for your response.
Yes, I did run in the Docker image built by the Docker file provided. I believe I've installed tensorrt_llm
successfully in this environment.
However, I encountered an error indicating that TensorRT does not have the attribute 'int64' (or 'int32' - I apologize, I can't recall it exactly). To address this, I noted that this attribute was introduced in TensorRT version 9.x, but my environment was using version 8.x. Consequently, I updated to TensorRT version 9.x to ensure compatibility with tensorrt_llm. Post this update, I faced the issue I originally described.
Regarding the re-installation of tensorrt_llm
, do you think it would resolve this specific problem? Given the time-consuming nature of this process, I'd greatly appreciate it if there are more efficient solutions.
Additionally, I want to highlight that my CUDA version is 11.8. I've taken extra care to ensure all related libraries are compatible with CUDA 11.8, including TensorRT, PyTorch, Xformer, and other NVIDIA auxiliary libraries. Could this be a potential reason for the error I'm experiencing?
It looks you don't enter the correct docker image or not build the correct docker image.
- In the docker image built by docker file, you should have TensorRT 9 and you don't need to install again.
- The issue here is your program cannot find the nccl, but nccl is also installed in the docker image. You should be able to find the nccl shared library in
/usr/lib/x86_64-linux-gnu/
of the docker.