TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Official Triton Inference Server Image with TRT-LLM support has no TRT-LLM

Open Anindyadeep opened this issue 1 year ago • 7 comments

System Info

I am using A100 GPU 80 GB

Who can help?

@byshiue @ncomly-nvidia

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

So I went to this website to download the official Nvidia image which has a support for TensorRT LLM. I pulled this image using this command:

docker pull nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3

And once pulled, I ran the image with this command:

docker run -it --gpus all nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash

Expected behavior

The expected output should be that when I import tensorrt_llm or use trtllm-build inside the container, it should work. However that is not working.

actual behavior

Here is the output I am getting:

:~$ docker pull nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3
:~$ docker run -it --gpus all nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 24.03 (build 86102893)
Triton Server Version 2.44.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.3 driver version 545.23.08 with kernel driver version 525.105.17.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

root@b63ad1046956:/opt/tritonserver# python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import tensorrt
>>> import tensorrt_llm
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'tensorrt_llm'
>>> 
>>> 
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'torch'
>>> 
>>> 

additional notes

I am not sure, but I hoped that simply pulling the image and using it should work. But I am blocked while importing the modules.

Anindyadeep avatar Apr 07 '24 14:04 Anindyadeep

The Triton server docker images only have the backends installed:

$ docker run -it --gpus all --rm nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 24.03 (build 86102893)
Triton Server Version 2.44.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

root@67aedfe801bb:/opt/tritonserver# ls -lR backends/
backends/:
total 8
drwxrwxrwx 2 triton-server triton-server 4096 Mar 16 03:58 python
drwxrwxrwx 2 triton-server triton-server 4096 Mar 16 03:57 tensorrtllm

backends/python:
total 2632
-rw-rw-rw- 1 triton-server triton-server 1362472 Mar 16 03:58 libtriton_python.so
-rwxrwxrwx 1 triton-server triton-server 1304768 Mar 16 03:58 triton_python_backend_stub
-rw-rw-rw- 1 triton-server triton-server   21877 Mar 16 03:58 triton_python_backend_utils.py

backends/tensorrtllm:
total 1846324
lrwxrwxrwx 1 triton-server triton-server         35 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so -> libnvinfer_plugin_tensorrt_llm.so.9
lrwxrwxrwx 1 triton-server triton-server         39 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so.9 -> libnvinfer_plugin_tensorrt_llm.so.9.2.0
-rw-rw-rw- 1 triton-server triton-server    2000936 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so.9.2.0
-rw-rw-rw- 1 triton-server triton-server 1888285344 Mar 16 03:57 libtensorrt_llm.so
-rw-rw-rw- 1 triton-server triton-server     339528 Mar 16 03:57 libtriton_tensorrtllm.so
root@67aedfe801bb:/opt/tritonserver#

You need to build the engines separately by installing the necessary dependencies in a different container or potentially in this container but I've never attempted that.

kristiankielhofner avatar Apr 07 '24 17:04 kristiankielhofner

The Triton server docker images only have the backends installed:

$ docker run -it --gpus all --rm nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 bash

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 24.03 (build 86102893)
Triton Server Version 2.44.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

root@67aedfe801bb:/opt/tritonserver# ls -lR backends/
backends/:
total 8
drwxrwxrwx 2 triton-server triton-server 4096 Mar 16 03:58 python
drwxrwxrwx 2 triton-server triton-server 4096 Mar 16 03:57 tensorrtllm

backends/python:
total 2632
-rw-rw-rw- 1 triton-server triton-server 1362472 Mar 16 03:58 libtriton_python.so
-rwxrwxrwx 1 triton-server triton-server 1304768 Mar 16 03:58 triton_python_backend_stub
-rw-rw-rw- 1 triton-server triton-server   21877 Mar 16 03:58 triton_python_backend_utils.py

backends/tensorrtllm:
total 1846324
lrwxrwxrwx 1 triton-server triton-server         35 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so -> libnvinfer_plugin_tensorrt_llm.so.9
lrwxrwxrwx 1 triton-server triton-server         39 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so.9 -> libnvinfer_plugin_tensorrt_llm.so.9.2.0
-rw-rw-rw- 1 triton-server triton-server    2000936 Mar 16 03:57 libnvinfer_plugin_tensorrt_llm.so.9.2.0
-rw-rw-rw- 1 triton-server triton-server 1888285344 Mar 16 03:57 libtensorrt_llm.so
-rw-rw-rw- 1 triton-server triton-server     339528 Mar 16 03:57 libtriton_tensorrtllm.so
root@67aedfe801bb:/opt/tritonserver#

You need to build the engines separately by installing the necessary dependencies in a different container or potentially in this container but I've never attempted that.

I see, building the engine through another container is not a problem, but can you help me on how can I run programs like run.py

Anindyadeep avatar Apr 07 '24 18:04 Anindyadeep

The triton backend image only has TensorRT-LLM backend, it does not contain TensorRT-LLM python package. So, you can only run serving on Triton directly. If you want to convert checkpoint or run example on TensorRT-LLM python example, you need to rebuild the TensorRT-LLM and install it.

byshiue avatar Apr 09 '24 08:04 byshiue

I see, and I understand that the reason of doing this is to allocate less space. However is it possible to have a image specific to trt llm python?

Anindyadeep avatar Apr 09 '24 08:04 Anindyadeep

You can install tensorrt_llm via pip now, but we are still improving the issue of version mismatch.

byshiue avatar Apr 10 '24 09:04 byshiue

I see, got it, thanks, will do

Anindyadeep avatar Apr 11 '24 02:04 Anindyadeep

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] avatar May 16 '24 01:05 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

github-actions[bot] avatar May 31 '24 01:05 github-actions[bot]

hi @byshiue

can you also serve llm with the following docker image? nvcr.io/nvidia/tritonserver:23.12-py3

geraldstanje avatar Jun 26 '24 15:06 geraldstanje

I am able to create the container for the image and get inside using docker run --rm -it --net host --shm-size=2g
--ulimit memlock=-1 --ulimit stack=67108864 --gpus all
-v </path/to/tensorrtllm_backend>:/tensorrtllm_backend
-v </path/to/engines>:/engines
nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3

where as if I am create the container using the dockercompose file, it is not creating the docker container and issue says driver version incompatibility.

I understand the issue here, but it is working with interactive mode

boggumaheshbabu avatar Jan 10 '25 05:01 boggumaheshbabu