TensorRT-LLM
TensorRT-LLM copied to clipboard
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficientl...
### System Info GPU (a10g). I have tried with an AWS g5.2xlarge instance and AWS g5.12xlarge instance. ### Who can help? @byshiue ### Information - [X] The official example scripts...
### System Info H20 * 1 ### Who can help? _No response_ ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks...
### System Info GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions: - https://github.com/NVIDIA/TensorRT-LLM.git (71d8d4d) - https://github.com/triton-inference-server/tensorrtllm_backend.git (bf5e900) Model: zephyr-7b-beta ### Who can help? @kaiyux @byshiue ### Information - [X]...
### System Info llama3 released https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6 https://github.com/meta-llama/llama3 ### Who can help? @ncomly-nvidia ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks...
### System Info CPU architecture: x86_64 Host RAM: 1TB GPU: 8xH100 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend (nvcr.io/nvidia/tritonserver:24.03-trtllm-python-py3 doesn't work for TRT LLM main branch?) TRT LLM...
python3 convert_checkpoint.py --model_dir /workspace/lk/model/Qwen/14B --output_dir ./tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_gptq --per_group [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300 0.10.0.dev2024042300 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02
Update some dead links
### System Info p4de (4 80GB A100 GPUs) ### Who can help? @Tracin @byshiue ### Information - [X] The official example scripts - [ ] My own modified scripts ###...
Before the attention operation the qkv tensors are implemented as one big tensor `qkv`, I would like to do some in-place operations for q and k only. Currently what I...
python convert_checkpoint.py --model_dir /workspace/lk/model/Qwen/14B/ --output_dir ./tllm_checkpoint_1gpu_fp16_wq --dtype float16 --use_weight_only --weight_only_precision int8 [TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300 0.10.0.dev2024042300 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:02