tensorrtllm_backend issues

Supporting beam search in streaming mode

Hi, My app requires streaming since I wan to stop the generation once a certain (complicated) condition is met. My decoding method is beam_search with beam_width=2, using greedy decoding or...

tonylek

feature request

Update end_to_end_test.py

Function Decomposition: The argument parsing logic was moved to a separate function parse_args() to improve readability and maintainability. This function encapsulates the logic related to parsing command-line arguments. Input Validation:...

r0cketdyne

Crash with high request concurrency

3

### System Info - 8*A800 80G ### Who can help? @kaiyux ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [X]...

silverriver

bug

lora_task_id, lora_weights, lora_config not found in all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py

### System Info nvidia-rtx-a100 ### Who can help? _No response_ ### Information - [X] The official example scripts - [ ] My own modified scripts ### Tasks - [ ]...

liao217

bug

Example of LoRa weights

2

I would like to send Lora weights through to a compiled tensor rt llm model but am unsure how to load the .bin weights and pass them to Triton. An...

TheCodeWrangler

triaged

TensorRT-LLM often hangs using both `tp_size 2` and `enable_context_fmha`.

2

### System Info - CPU architecture: x86_64 - CPU/Host memory size: 1T - GPU name: NVIDIA A100-40G - TensorRT-LLM branch: main, v0.9.0, 118b3d7 - CUDA: 12.3 - NVIDIA driver: 545.23.08...

lkm2835

bug

[Documentation improvement] Improve README for tensorrtllm_backend - v0.8.0

https://github.com/triton-inference-server/tensorrtllm_backend/tree/v0.8.0 The README for this Triton server version has many references to the `23.10` version of Triton, which I believe based on the [support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/), does **not** support v0.8.0. v0.8.0...

kelkarn

documentation

Confusion about versions and NGC images

4

Hi Thank you for the great work you're doing on TensorRT-LLM and the Triton backend. I have some questions on matching versions between the tensorrt-llm python package, the backend, and...

mbahri

help wanted

Can tensorrtllm backend support LogitsProcessor?

1

In TensorRT-LLM, it is possible to integrate a LogitsProcessor during model inference to control the behavior of the inference process. Is it feasible to add a similar interface in the...

Muxv

triaged

SAFETENSORS and OpenAI style endpoint

5

### System Info I have searched the repo here and the main server repo but don't see any information on either a) support for Safetensors (many models are saved that...

RonanKMcGovern

question

triaged

tensorrtllm_backend
tensorrtllm_backend copied to clipboard

Metadata

Supporting beam search in streaming mode

Update end_to_end_test.py

Crash with high request concurrency

lora_task_id, lora_weights, lora_config not found in all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py

Example of LoRa weights

TensorRT-LLM often hangs using both `tp_size 2` and `enable_context_fmha`.

[Documentation improvement] Improve README for tensorrtllm_backend - v0.8.0

Confusion about versions and NGC images

Can tensorrtllm backend support LogitsProcessor?

SAFETENSORS and OpenAI style endpoint

← Metadata

Owner

Metadata

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Metadata

← Metadata

Owner

Metadata

tensorrtllm_backend
tensorrtllm_backend copied to clipboard