TensorRT-LLM
TensorRT-LLM copied to clipboard
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficientl...
# Feat:enable kvcache to be reused during request generation Issue: [https://github.com/NVIDIA/TensorRT-LLM/issues/3733] [issues/3733][feat] enable kvcache to be reused during request generation ## Description This PR enhances the KV cache reuse logic...
# PR title Please write the PR title by following template: [JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] \ For example, assume I have a PR hope to support a new...
## Description End-To-End support of Ngram drafter in pyTorch workflow (old name Prompt-Lookup-Decoding (PLD) in TRT workflow). Usage example: ```bash python examples/pytorch/quickstart_advanced.py \ --spec_decode_algo NGRAM \ --spec_decode_nextn 4 \ --max_matching_ngram_size...
This PR, in conjunction with [PR 3769](https://github.com/NVIDIA/TensorRT-LLM/pull/3769) , provides an interface solution for dynamically linking NIXL.
last PR:https://github.com/NVIDIA/TensorRT-LLM/pull/3851 last revet PR:https://github.com/NVIDIA/TensorRT-LLM/pull/4340
# Remove vila test from backend tests Please write the PR title by following template: [JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] \ For example, assume I have a PR hope...
# Add llama4 disagg accuracy test `[05/14/2025-17:25:09] [TRT-LLM] [I] MMLU weighted average accuracy: 80.38 (4104)` Please write the PR title by following template: [JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] \...
# Support MCP in TensorRT-LLM Scaffolding support MCP #3335 https://github.com/NVIDIA/TensorRT-LLM/issues/3335 ## Description MCP provides a standard tool-use ability to TensorRT-LLM, to utilize LLM function-call. ## examples run a mcp server...
## Description There were two bugs in the tests: - moe_backend was not passed to `PyTorchConfig` - `batch_size` should have been `max_batch_size` After fixing these, I reran the tests and...
# [TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug ## Description This PR adds in a flag for bidirectional_attention to `modeling_llama.py`. This is...