TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficientl...

Results 937 TensorRT-LLM issues
Sort by recently updated
recently updated
newest added

# Feat:enable kvcache to be reused during request generation Issue: [https://github.com/NVIDIA/TensorRT-LLM/issues/3733] [issues/3733][feat] enable kvcache to be reused during request generation ## Description This PR enhances the KV cache reuse logic...

Community want to contribute
Community Engagement

# PR title Please write the PR title by following template: [JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] \ For example, assume I have a PR hope to support a new...

## Description End-To-End support of Ngram drafter in pyTorch workflow (old name Prompt-Lookup-Decoding (PLD) in TRT workflow). Usage example: ```bash python examples/pytorch/quickstart_advanced.py \ --spec_decode_algo NGRAM \ --spec_decode_nextn 4 \ --max_matching_ngram_size...

This PR, in conjunction with [PR 3769](https://github.com/NVIDIA/TensorRT-LLM/pull/3769) , provides an interface solution for dynamically linking NIXL.

last PR:https://github.com/NVIDIA/TensorRT-LLM/pull/3851 last revet PR:https://github.com/NVIDIA/TensorRT-LLM/pull/4340

# Remove vila test from backend tests Please write the PR title by following template: [JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] \ For example, assume I have a PR hope...

# Add llama4 disagg accuracy test `[05/14/2025-17:25:09] [TRT-LLM] [I] MMLU weighted average accuracy: 80.38 (4104)` Please write the PR title by following template: [JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] \...

# Support MCP in TensorRT-LLM Scaffolding support MCP #3335 https://github.com/NVIDIA/TensorRT-LLM/issues/3335 ## Description MCP provides a standard tool-use ability to TensorRT-LLM, to utilize LLM function-call. ## examples run a mcp server...

Community want to contribute
Community Engagement

## Description There were two bugs in the tests: - moe_backend was not passed to `PyTorchConfig` - `batch_size` should have been `max_batch_size` After fixing these, I reran the tests and...

# [TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug ## Description This PR adds in a flag for bidirectional_attention to `modeling_llama.py`. This is...