TensorRT-LLM
TensorRT-LLM copied to clipboard
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficientl...
### System Info [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600[02/16/2024-22:04:57] [TRT-LLM] [I] Loading engine from ./plan/visual_encoder/visual_encoder_fp16.plan [02/16/2024-22:05:00] [TRT-LLM] [I] Creating session from engine ./plan/visual_encoder/visual_encoder_fp16.plan [02/16/2024-22:05:00] [TRT] [I] Loaded engine size: 3714 MiB [02/16/2024-22:05:00]...
As the title suggests, this PR removes TP (tensor parallelism) for MoE router. Duplicating router across GPUs removes an allreduce for each MoE layer. This small change leads to **4-18%...
https://github.com/NVIDIA/TensorRT-LLM/blob/3d56a445e8ebf888e78be638faf6beec0a78f3c2/cpp/tensorrt_llm/runtime/worldConfig.cpp#L74 Hi, I've run into a small bug with the CPP implementation of the runtime code. I am running multi-node inference on Llama2 with pipeline parallelism 2 and tensor parallelism...
## Summary Add weight-only quantization for T5. I've added this to the path loading from binary weights. I do not think the HF weight loading currently works, so I have...
### System Info - System independent. This issue is re: docs - In the benchmarking page there are multiple references to build.py scripts that don't exist as far as I...
To reflect the correct usage as I understand it when you have elevated privileges The suggested change worked for me and the original didn't. Also: looking at the Makefile my...
### System Info - RTX 4090 - x86_64 GNU/Linux - main branch ### Who can help? _No response_ ### Information - [X] The official example scripts - [ ] My...
### System Info - CPU architecture: x64 - GPU: RTX 4090 24G - CUDA 12.2 ### Who can help? @byshiue @nc ### Information - [X] The official example scripts -...
https://github.com/THUDM/CogVLM CogVLM is one of the best models for describing images, much better than qwen vl in my experience. To make image subtitles faster would be a huge gain. Being...
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024020600free(): invalid pointer [95e079756bc2:03949] *** Process received signal *** [95e079756bc2:03949] Signal: Aborted (6) [95e079756bc2:03949] Signal code: (-6) [95e079756bc2:03949] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f754a216520]...