tensorrtllm_backend issues

Support bfloat16 LoRa Adaptors

5

I have a Mistral7B model with fine-tuned LoRa weights with datatype bfloat16. I ran into issues when attempting to use my adaptors which were compiled for bfloat16 Running the following...

TheCodeWrangler

bug

triaged

How to make the server call tensorrt_llm/examples/run.py?

4

I've followed the instruction https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/baichuan.md to run Baichuan2-7b-Chat. But for exactly the same engine, the outputs are always different by running `curl -X POST localhost:8000/v2/models/ensemble/generate` and `python /tensorrtllm_backend/tensorrt_llm/examples/run.py` Somehow the...

shil3754

Performance Issue with return_context_logits Enabled in TensorRT-LLM

### System Info Intel(R) Xeon(R) CPU @ 2.20GHz Architecture: x86_64 NVIDIA A100-SXM4-40G Ubuntu ### Who can help? @kaiyux ### Information - [X] The official example scripts - [X] My own...

gywlssww

bug

Verify if inflight batching is running

3

### System Info - CPU architecture: `x86_64` - GPU: NVIDIA A10 24GB - TensorRT-LLM: `v0.8.0` (docker build via `make -C docker release_build CUDA_ARCHS="86-real"`) - Triton Inference Server: `r24.02` (docker from...

bprus

bug

run demo generation failed

6

### System Info x86_64 V100 triton server image: nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 tensorrtllm_backend: v0.7.1 ### Who can help? _No response_ ### Information - [X] The official example scripts - [ ] My own...

biaochen

bug

triaged

Crashes for long context requests

17

trtllm crashes when I give long context requests within the `max-input-length` limits. I believe it happens when total pending requests reach the `max-num-tokens` limit. But why it's not queuing requests...

Pernekhan

No white space included in tokens sent back by Llama2 in streaming mode

22

### System Info - DGX H100 - TensorrtLlm 0.7.1 ### Who can help? _No response_ ### Information - [X] The official example scripts - [ ] My own modified scripts...

jfpichlme

bug

triaged

Fixed Whitespace Error in Streaming mode

2

When a TensorRTLLM is deployed with streaming mode tokens have no white spaces in between streaming chunks. [Link to Issue](https://github.com/triton-inference-server/tensorrtllm_backend/issues/332#issuecomment-2063243340 ) This is because calling tokenizer.decode does a whitespace strip...

enochlev

Add example of tensorrt-llm usage

1

Pernekhan

modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found

12

**Description** Trying to deploy Mistral-7B with Triton+TensorRT-LLM and running into this issue **Triton Information** Are you using the Triton container or did you build it yourself? nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 **To Reproduce** Steps...

shamikatamazon

triaged

tensorrtllm_backend
tensorrtllm_backend copied to clipboard

Metadata

Support bfloat16 LoRa Adaptors

How to make the server call tensorrt_llm/examples/run.py?

Performance Issue with return_context_logits Enabled in TensorRT-LLM

Verify if inflight batching is running

run demo generation failed

Crashes for long context requests

No white space included in tokens sent back by Llama2 in streaming mode

Fixed Whitespace Error in Streaming mode

Add example of tensorrt-llm usage

modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found

← Metadata

Owner

Metadata

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Metadata

← Metadata

Owner

Metadata

tensorrtllm_backend
tensorrtllm_backend copied to clipboard