tensorrtllm_backend issues

Is the TRT-LLM backend supported on MIG-enabled node pool?

4

TRT-LLM version: **0.5.0** Triton server version: **23.10** GPU type: A100, 80GB, with MIG enabled (20gb GPU memory per split, 3 splits per node). I am trying to run a Falcon-7B...

kelkarn

Start Triton failed to load libtriton_tensorrtllm on aarch64.

2

Hi folk. I am currently testing on the AWS EC2 G5G series (AWS Graviton2, ARM64). Here is my error output. ``` I1122 15:20:46.453111 168 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch I1122 15:20:46.453159 168...

matichon-vultureprime

triaged

ARM

Block reuse is currently not supported with beam width > 1

3

Is there a plan to add support for block reuse in beam search? Could be very helpful. When I try to use it I get the exception of: Block reuse...

tonylek

triaged

Performance Issue with return_context_logits Enabled in TensorRT-LLM

1

### System Info Intel(R) Xeon(R) CPU @ 2.20GHz Architecture: x86_64 NVIDIA A100-SXM4-40G Ubuntu ### Who can help? _No response_ ### Information - [X] The official example scripts - [X] My...

metterian

bug

triaged

Thread [0] had error: in ensemble 'ensemble', Encountered error for requestId 498689237: Cannot process new request: Streaming mode is only supported with beam width of 1.

7

Try to use perf_analyzer as follows deploying LLaMA2-13B with triton: python scripts/launch_triton_server.py --world_size 2 --model_repo triton_model_repo perf_analyzer -m ensemble -i grpc --shape "bad_words:1" --shape "max_tokens:1" --shape "stop_words:1" --shape "text_input:1" --streaming...

Juelianqvq

triaged

Seg fault after loaded models in official example

2

### System Info arch - x86-64 gpu - rtx3070 docker image nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 tensorRT-LLM-backend tag - 0.7.2 tensorRT-LLM tag - 0.7.1 (80bc07510ac4ddf13c0d76ad295cdb2b75614618) ### Who can help? @juney-nvidia ### Information - [X]...

LeatherDeerAU

bug

How is GptManager used in Triton backend?

1

I see that Triton backend creates an [object of GptManager](https://github.com/triton-inference-server/tensorrtllm_backend/blob/bf5e9007a3f16c7fc76cb156a3362d1caae198dd/inflight_batcher_llm/src/model_instance_state.cc#L388) which gets passed the engine dir. However, I unable to see any code that shows how this GptManager is being...

ekagra-ranjan

Filtering beam_search output tensors results in a string output vs list

1

We have implemented a custom postprocessing step in beam search decoding where we filter some outputs out of the final beam output. In a case where we are left with...

nikhilshandilya

triaged

Feature Request: Set maximum number of in flight

1

When unexpected large bursts in requests come to my application I would like to be able to limit the number of requests that will be accepted by trtllm backend. I...

TheCodeWrangler

feature request

add speculative decoding example

3

This PR is about adding a speculative decoding example.

XiaobingSuper

tensorrtllm_backend
tensorrtllm_backend copied to clipboard

Metadata

Is the TRT-LLM backend supported on MIG-enabled node pool?

Start Triton failed to load libtriton_tensorrtllm on aarch64.

Block reuse is currently not supported with beam width > 1

Performance Issue with return_context_logits Enabled in TensorRT-LLM

Thread [0] had error: in ensemble 'ensemble', Encountered error for requestId 498689237: Cannot process new request: Streaming mode is only supported with beam width of 1.

Seg fault after loaded models in official example

How is GptManager used in Triton backend?

Filtering beam_search output tensors results in a string output vs list

Feature Request: Set maximum number of in flight

add speculative decoding example

← Metadata

Owner

Metadata

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Metadata

← Metadata

Owner

Metadata

tensorrtllm_backend
tensorrtllm_backend copied to clipboard