TensorRT-LLM
TensorRT-LLM copied to clipboard
Official Triton Inference Server Image with TRT-LLM support has no TRT-LLM
System Info
I am using A100 GPU 80 GB
Who can help?
@byshiue @ncomly-nvidia
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
So I went to this website to download the official Nvidia image which has a support for TensorRT LLM. I pulled this image using this command:
docker pull nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3
And once pulled, I ran the image with this command:
docker run -it --gpus all nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash
Expected behavior
The expected output should be that when I import tensorrt_llm or use trtllm-build inside the container, it should work. However that is not working.
actual behavior
Here is the output I am getting:
:~$ docker pull nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3
:~$ docker run -it --gpus all nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash
=============================
== Triton Inference Server ==
=============================
NVIDIA Release 24.03 (build 86102893)
Triton Server Version 2.44.0
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.3 driver version 545.23.08 with kernel driver version 525.105.17.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.
root@b63ad1046956:/opt/tritonserver# python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import tensorrt
>>> import tensorrt_llm
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'tensorrt_llm'
>>>
>>>
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'torch'
>>>
>>>
additional notes
I am not sure, but I hoped that simply pulling the image and using it should work. But I am blocked while importing the modules.
The Triton server docker images only have the backends installed:
$ docker run -it --gpus all --rm nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash
=============================
== Triton Inference Server ==
=============================
NVIDIA Release 24.03 (build 86102893)
Triton Server Version 2.44.0
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
root@67aedfe801bb:/opt/tritonserver# ls -lR backends/
backends/:
total 8
drwxrwxrwx 2 triton-server triton-server 4096 Mar 16 03:58 python
drwxrwxrwx 2 triton-server triton-server 4096 Mar 16 03:57 tensorrtllm
backends/python:
total 2632
-rw-rw-rw- 1 triton-server triton-server 1362472 Mar 16 03:58 libtriton_python.so
-rwxrwxrwx 1 triton-server triton-server 1304768 Mar 16 03:58 triton_python_backend_stub
-rw-rw-rw- 1 triton-server triton-server 21877 Mar 16 03:58 triton_python_backend_utils.py
backends/tensorrtllm:
total 1846324
lrwxrwxrwx 1 triton-server triton-server 35 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so -> libnvinfer_plugin_tensorrt_llm.so.9
lrwxrwxrwx 1 triton-server triton-server 39 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so.9 -> libnvinfer_plugin_tensorrt_llm.so.9.2.0
-rw-rw-rw- 1 triton-server triton-server 2000936 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so.9.2.0
-rw-rw-rw- 1 triton-server triton-server 1888285344 Mar 16 03:57 libtensorrt_llm.so
-rw-rw-rw- 1 triton-server triton-server 339528 Mar 16 03:57 libtriton_tensorrtllm.so
root@67aedfe801bb:/opt/tritonserver#
You need to build the engines separately by installing the necessary dependencies in a different container or potentially in this container but I've never attempted that.
The Triton server docker images only have the backends installed:
$ docker run -it --gpus all --rm nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash ============================= == Triton Inference Server == ============================= NVIDIA Release 24.03 (build 86102893) Triton Server Version 2.44.0 Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license root@67aedfe801bb:/opt/tritonserver# ls -lR backends/ backends/: total 8 drwxrwxrwx 2 triton-server triton-server 4096 Mar 16 03:58 python drwxrwxrwx 2 triton-server triton-server 4096 Mar 16 03:57 tensorrtllm backends/python: total 2632 -rw-rw-rw- 1 triton-server triton-server 1362472 Mar 16 03:58 libtriton_python.so -rwxrwxrwx 1 triton-server triton-server 1304768 Mar 16 03:58 triton_python_backend_stub -rw-rw-rw- 1 triton-server triton-server 21877 Mar 16 03:58 triton_python_backend_utils.py backends/tensorrtllm: total 1846324 lrwxrwxrwx 1 triton-server triton-server 35 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so -> libnvinfer_plugin_tensorrt_llm.so.9 lrwxrwxrwx 1 triton-server triton-server 39 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so.9 -> libnvinfer_plugin_tensorrt_llm.so.9.2.0 -rw-rw-rw- 1 triton-server triton-server 2000936 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so.9.2.0 -rw-rw-rw- 1 triton-server triton-server 1888285344 Mar 16 03:57 libtensorrt_llm.so -rw-rw-rw- 1 triton-server triton-server 339528 Mar 16 03:57 libtriton_tensorrtllm.so root@67aedfe801bb:/opt/tritonserver#You need to build the engines separately by installing the necessary dependencies in a different container or potentially in this container but I've never attempted that.
I see, building the engine through another container is not a problem, but can you help me on how can I run programs like run.py
The triton backend image only has TensorRT-LLM backend, it does not contain TensorRT-LLM python package. So, you can only run serving on Triton directly. If you want to convert checkpoint or run example on TensorRT-LLM python example, you need to rebuild the TensorRT-LLM and install it.
I see, and I understand that the reason of doing this is to allocate less space. However is it possible to have a image specific to trt llm python?
You can install tensorrt_llm via pip now, but we are still improving the issue of version mismatch.
I see, got it, thanks, will do
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue was closed because it has been stalled for 15 days with no activity.
hi @byshiue
can you also serve llm with the following docker image? nvcr.io/nvidia/tritonserver:23.12-py3
I am able to create the container for the image and get inside using docker run --rm -it --net host --shm-size=2g
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all
-v </path/to/tensorrtllm_backend>:/tensorrtllm_backend
-v </path/to/engines>:/engines
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
where as if I am create the container using the dockercompose file, it is not creating the docker container and issue says driver version incompatibility.
I understand the issue here, but it is working with interactive mode