TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

ImportError and OSError When Importing tensorrt_llm in Python 3.10 Environment

Open XavierSpycy opened this issue 1 year ago • 3 comments

I'm encountering an issue when trying to import the tensorrt_llm package in my Python environment. I'm using Python 3.10, and the error seems to be related to missing dependencies and shared object files. Below are the details of the error message and my environment.

Environment

Python Version: 3.10 Operating System: Ubuntu tensorrt_llm Version: 0.7.0

Error Description

First, there's a warning indicating that a required package 'psutil' is not installed. The warning suggests installing 'pynvml>=11.5.0', but it's unclear if this is the correct package or version.

Warning Message:

[01/31/2024-11:19:55] [TRT-LLM] [W] A required package 'psutil' is not installed. Will not monitor the device memory usages. Please install the package first, e.g, 'pip install pynvml>=11.5.0'.

However, in my environment, psutil has already been installed.

Following this warning, an OSError occurs, mentioning that the libnccl.so.2 file cannot be found.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/rag/lib/python3.10/site-packages/tensorrt_llm/__init__.py", line 61, in <module>
    _init(log_level="error")
  File "/home/ubuntu/miniconda3/envs/rag/lib/python3.10/site-packages/tensorrt_llm/_common.py", line 47, in _init
    _load_plugin_lib()
  File "/home/ubuntu/miniconda3/envs/rag/lib/python3.10/site-packages/tensorrt_llm/plugin/plugin.py", line 34, in _load_plugin_lib
    handle = ctypes.CDLL(plugin_lib_path(),
  File "/home/ubuntu/miniconda3/envs/rag/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libnccl.so.2: cannot open shared object file: No such file or directory

Additional Context

Any help or guidance on resolving these issues would be greatly appreciated.

Thank you!

XavierSpycy avatar Jan 31 '24 03:01 XavierSpycy

Do you run in the docker image built by the docker file? If not, could you take a try? It looks you miss some packages in your environment.

byshiue avatar Jan 31 '24 09:01 byshiue

Thank you for your response.

Yes, I did run in the Docker image built by the Docker file provided. I believe I've installed tensorrt_llm successfully in this environment.

However, I encountered an error indicating that TensorRT does not have the attribute 'int64' (or 'int32' - I apologize, I can't recall it exactly). To address this, I noted that this attribute was introduced in TensorRT version 9.x, but my environment was using version 8.x. Consequently, I updated to TensorRT version 9.x to ensure compatibility with tensorrt_llm. Post this update, I faced the issue I originally described.

Regarding the re-installation of tensorrt_llm, do you think it would resolve this specific problem? Given the time-consuming nature of this process, I'd greatly appreciate it if there are more efficient solutions.

Additionally, I want to highlight that my CUDA version is 11.8. I've taken extra care to ensure all related libraries are compatible with CUDA 11.8, including TensorRT, PyTorch, Xformer, and other NVIDIA auxiliary libraries. Could this be a potential reason for the error I'm experiencing?

XavierSpycy avatar Jan 31 '24 10:01 XavierSpycy

It looks you don't enter the correct docker image or not build the correct docker image.

  1. In the docker image built by docker file, you should have TensorRT 9 and you don't need to install again.
  2. The issue here is your program cannot find the nccl, but nccl is also installed in the docker image. You should be able to find the nccl shared library in /usr/lib/x86_64-linux-gnu/ of the docker.

byshiue avatar Feb 04 '24 09:02 byshiue