TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficientl...

Results 937 TensorRT-LLM issues
Sort by recently updated
recently updated
newest added

### System Info GPU (a10g). I have tried with an AWS g5.2xlarge instance and AWS g5.12xlarge instance. ### Who can help? @byshiue ### Information - [X] The official example scripts...

bug

### System Info H20 * 1 ### Who can help? _No response_ ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks...

question
triaged
neeed more info

### System Info GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions: - https://github.com/NVIDIA/TensorRT-LLM.git (71d8d4d) - https://github.com/triton-inference-server/tensorrtllm_backend.git (bf5e900) Model: zephyr-7b-beta ### Who can help? @kaiyux @byshiue ### Information - [X]...

bug

### System Info llama3 released https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6 https://github.com/meta-llama/llama3 ### Who can help? @ncomly-nvidia ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks...

bug
triaged
feature request

### System Info CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend (nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 doesn't work for TRT LLM main branch?) TRT LLM...

bug
triaged

python3 convert_checkpoint.py --model_dir /workspace/lk/model/Qwen/14B --output_dir ./tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_gptq --per_group [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300 0.10.0.dev2024042300 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02

triaged

### System Info p4de (4 80GB A100 GPUs) ### Who can help? @Tracin @byshiue ### Information - [X] The official example scripts - [ ] My own modified scripts ###...

bug

Before the attention operation the qkv tensors are implemented as one big tensor `qkv`, I would like to do some in-place operations for q and k only. Currently what I...

python convert_checkpoint.py --model_dir /workspace/lk/model/Qwen/14B/ --output_dir ./tllm_checkpoint_1gpu_fp16_wq --dtype float16 --use_weight_only --weight_only_precision int8 [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300 0.10.0.dev2024042300 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02

triaged