text-generation-webui Add TensorRT-LLM support

TensorRT-LLM (https://github.com/NVIDIA/TensorRT-LLM/) is a new inference backend developed by NVIDIA.

It only works on NVIDIA GPUs.
It supports several quantization methods (GPTQ, AWQ, FP8, SmoothQuant), as well as 16-bit inference.

In my testing, I found it to be consistently faster than ExLlamaV2 in both prompt processing and evaluation. That makes it the new SOTA inference backend in terms of speed.

Speed tests

Model	Precision	Backend	Prompt processing (3200 tokens, t/s)	Generation (512 tokens, t/s)
TheBloke/Llama-2-7B-GPTQ	4-bit	TRT-LLM + ModelRunnerCpp	8014.99	138.84
TheBloke/Llama-2-7B-GPTQ	4-bit	TRT-LLM + ModelRunner	7572.49	125.45
TheBloke/Llama-2-7B-GPTQ	4-bit	ExLlamaV2	6138.32	130.16

TheBloke/Llama-2-13B-GPTQ	4-bit	TRT-LLM + ModelRunnerCpp	4553.43	80.69
TheBloke/Llama-2-13B-GPTQ	4-bit	TRT-LLM + ModelRunner	4161.57	75.80
TheBloke/Llama-2-13B-GPTQ	4-bit	ExLlamaV2	3499.26	75.28

NousResearch_Llama-2-7b-hf	16-bit	TRT-LLM + ModelRunnerCpp	8465.27	55.54
NousResearch_Llama-2-7b-hf	16-bit	TRT-LLM + ModelRunner	7676.80	53.33
NousResearch_Llama-2-7b-hf	16-bit	ExLlamaV2	6511.87	53.02

NousResearch_Llama-2-13b-hf	16-bit	TRT-LLM + ModelRunnerCpp	4621.76	29.95
NousResearch_Llama-2-13b-hf	16-bit	TRT-LLM + ModelRunner	4299.16	29.22
NousResearch_Llama-2-13b-hf	16-bit	ExLlamaV2	3881.43	29.11

I provided the models with a 3200 tokens input and measured the time to process those 3200 tokens and the time to generate 512 tokens afterwards. I did this over API, and each number in the table above is a median out of 20 measurements.

To accurately measure the TensorRT-LLM speeds, it was necessary to do a warmup generation before starting the measurements, as the first generation has an overhead due to module imports. The same warmup was done for ExLlamaV2 as well.

The tests were carried out in an RTX 6000 Ada GPU.

Installation

Option 1: Docker

Just use the included Dockerfile under docker/TensorRT-LLM/Dockerfile, which will automatically set everything up from scratch.

II find the following commands useful (make sure to run them after moving into the folder containing the Dockerfile with cd):

# Build the image
docker build -t mylocalimage:debug .

# Run the container mapping port 7860 from the host to port 7860 in the container
docker run -p 7860:7860 mylocalimage:debug

# Run the container with GPU support
docker run -p 7860:7860 --gpus all mylocalimage:debug

# Run the container interactively (-it), spawning a Bash shell (/bin/bash) within the container
docker run -p 7860:7860 -it mylocalimage:debug /bin/bash

Option 2: Manually

TensorRT-LLM only works on Python 3.10 at the moment, while this project uses Python 3.11 by default, so it's necessary to create a separate Python 3.10 conda environment:

# Install system-wide TensorRT-LLM requirements
sudo apt-get -y install openmpi-bin libopenmpi-dev

# Create a Python 3.10 environment
conda create -n tensorrt python=3.10
conda activate tensorrt

# Install text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui/
cd text-generation-webui
pip install -r requirements.txt
pip uninstall -y flash_attn  # Incompatible with the PyTorch version installed by TensorRT-LLM

# This is needed to avoid an error about "Failed to build mpi4py" in the next command
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

# Install TensorRT-LLM
pip3 install tensorrt_llm==0.9.0.dev2024030500 -U --pre --extra-index-url https://pypi.nvidia.com

Make sure to paste the commands above in the specified order.

For Windows setup and more information about installation, consult the official README.

Converting a model

Contrary to what happens with other backends, it's necessary to convert the model before using it so it gets optimized for your GPU (or GPUs). These are the commands that I have used:

FP16 models

#!/bin/bash

CHECKPOINT_DIR=/home/me/text-generation-webui/models/NousResearch_Llama-2-7b-hf

cd /home/me/TensorRT-LLM/

python examples/llama/convert_checkpoint.py \
    --model_dir $CHECKPOINT_DIR \
    --output_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --dtype float16

trtllm-build \
    --checkpoint_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --output_dir ${CHECKPOINT_DIR}_TensorRT \
    --gemm_plugin float16 \
    --max_input_len 4096 \
    --max_output_len 512

# Copy the tokenizer files
cp ${CHECKPOINT_DIR}/{tokenizer*,special_tokens_map.json} ${CHECKPOINT_DIR}_TensorRT

GPTQ models

#!/bin/bash

CHECKPOINT_DIR=/home/me/text-generation-webui/models/NousResearch_Llama-2-7b-hf
QUANTIZED_DIR=/home/me/text-generation-webui/models/TheBloke_Llama-2-7B-GPTQ
QUANTIZED_FILE="model.safetensors"

cd /home/me/TensorRT-LLM/

python examples/llama/convert_checkpoint.py \
    --model_dir $CHECKPOINT_DIR \
    --output_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --dtype float16 \
    --ammo_quant_ckpt_path "${QUANTIZED_DIR}/$QUANTIZED_FILE" \
    --use_weight_only \
    --weight_only_precision int4_gptq \
    --per_group

trtllm-build \
    --checkpoint_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --output_dir ${QUANTIZED_DIR}_TensorRT \
    --gemm_plugin float16 \
    --max_input_len 4096 \
    --max_output_len 512

# Copy the tokenizer files
cp ${CHECKPOINT_DIR}/{tokenizer*,special_tokens_map.json} ${QUANTIZED_DIR}_TensorRT

More commands can be found on this page:

https://github.com/NVIDIA/TensorRT-LLM/tree/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/llama

Make sure to use this commit of TensorRT-LLM for the commands above to work:

git clone https://github.com/NVIDIA/TensorRT-LLM/
cd TensorRT-LLM
git checkout 728cc0044bb76d1fafbcaa720c403e8de4f81906

They will generate folders named like this, containing both the converted model and a copy of the tokenizer files:

NousResearch_Llama-2-7b-hf_TensorRT
NousResearch_Llama-2-13b-hf_TensorRT
TheBloke_Llama-2-7B-GPTQ_TensorRT
TheBloke_Llama-2-13B-GPTQ_TensorRT

Loading a model

Here is an example:

python server.py \
  --model TheBloke_Llama-2-7B-GPTQ_TensorRT \
  --loader TensorRT-LLM \
  --max_seq_len 4096

Details

In the conversion phase, it is necessary to set a fixed value for max_new_tokens which can't be changed at generation time. In my commands above, I set it to 512, and I recommend using this value.
There are two ways to load the model: with a class called ModelRunnerCpp or with another one called ModelRunner. The first is faster but it does not support streaming yet. You can use it with the --cpp-runner flag.

TODO

[ ] Figure out prefix matching. This is already implemented, but there is no clear documentation on how to use it -- see issues #1043 and #620.
[ ] Create a TensorRT-LLM_HF loader integrated with the existing sampling functions in the project.

Mar 17 '24 15:03 oobabooga

Heh, I was able to have flash attention with torch 2.2.1. I had tried TRT on the SD side and the hassle wasn't worth it. I wonder how this does with multi-gpu inference. It not using flash attention also probably really balloons the context. Will be fun to find out.

Mar 17 '24 18:03 Ph0rk0z

TensorRT and Triton Interface Server can reserve memory for several video cards at once and respond to several users in parallel. Is it possible to transfer this functionality to the Text-generation-web ui?

Apr 25 '24 12:04 Nurgl

TensorRT and Triton Interface Server can reserve memory for several video cards at once and respond to several users in parallel. Is it possible to transfer this functionality to the Text-generation-web ui?

It should be possible -- the first step would be to remove the semaphore from modules/text_generation.py and figure out how to connect things together, maybe with a command-line flag for the maximum number of concurrent users. A PR with that addition would be welcome.

Apr 25 '24 22:04 oobabooga

it would be nice to make both a queue mode and a parallel processing mode

May 02 '24 06:05 Nurgl

text-generation-webui text-generation-webui copied to clipboard

Add TensorRT-LLM support

Speed tests

Installation

Option 1: Docker

Option 2: Manually

Converting a model

Loading a model

Details

TODO

text-generation-webui
text-generation-webui copied to clipboard