TensorRT-LLM
TensorRT-LLM copied to clipboard
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficientl...
Having problems when using MPT. Setting: AWS g5.48xlarge, CUDA 12.1.0, Ubuntu 22.04, python 3.10, pytorch 2.1.2. ``` root@7f51eddb66f5:/TensorRT-LLM/examples/mpt# trtllm-build --checkpoint_dir=./ft_ckpts/mpt-7b/fp16 \ --max_batch_size 32 \ --max_input_len 1024 \ --max_output_len 512 \...
Device: Win 11; RTX 4090 When I run: `make -C docker release_build` it fails with the error below. ``` make: Entering directory '/home/mustapham/TensorRT-LLM/docker' Building docker image: tensorrt_llm/release:latest DOCKER_BUILDKIT=1 docker build...
### System Info Any ### Who can help? _No response_ ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [ ]...
Hi, while trying to run this ``` python build.py --model_dir $model_dir$ \ --dtype float16 \ --use_gpt_attentionZ_plugin float16 \ --use_gemm_plugin float16 \ --max_batch_size 4 \ --max_input_len 128 \ --max_output_len 128 ```...
Single GPU is OK, System hangs when I use multiple GPUs. Can someone help solve this? Thanks. python build.py --model_dir meta-llama/Llama-2-7b-chat-hf \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \...
Trying out T5 with python backend. https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/enc_dec/run.py#L484 I see SamplingConfig has output_log_probs https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/generation.py#L355. But in the return dict does not have the log probabilities https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/runtime/generation.py#L2515. Is there any other way...
Here is my build command. ``` python build.py --model_dir Yi-34B-Chat --dtype float16 --remove_input_padding --use_gemm_plugin float16 --use_gpt_attention_plugin float16 --world_size 2 --tp_size 2 --enable_context_fmha --use_inflight_batching --paged_kv_cache --load_by_shard --use_weight_only --weight_only_precision int4 --output_dir /app/triton_model/tensorrt_llm/1...
### System Info CPU x86_64 GPU NVIDIA A10 TensorRT branch: main commid id:cad22332550eef9be579e767beb7d605dd96d6f3 CUDA: NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 ### Who can help? Quantization: @Tracin ### Information...
### System Info H800 80G ### Who can help? _No response_ ### Information - [x] The official example scripts - [ ] My own modified scripts ### Tasks - [x]...