tensorrtllm_backend icon indicating copy to clipboard operation
tensorrtllm_backend copied to clipboard

The Triton TensorRT-LLM Backend

Results 251 tensorrtllm_backend issues
Sort by recently updated
recently updated
newest added

I have a Mistral7B model with fine-tuned LoRa weights with datatype bfloat16. I ran into issues when attempting to use my adaptors which were compiled for bfloat16 Running the following...

bug
triaged

I've followed the instruction https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/baichuan.md to run Baichuan2-7b-Chat. But for exactly the same engine, the outputs are always different by running `curl -X POST localhost:8000/v2/models/ensemble/generate` and `python /tensorrtllm_backend/tensorrt_llm/examples/run.py` Somehow the...

### System Info Intel(R) Xeon(R) CPU @ 2.20GHz Architecture: x86_64 NVIDIA A100-SXM4-40G Ubuntu ### Who can help? @kaiyux ### Information - [X] The official example scripts - [X] My own...

bug

### System Info - CPU architecture: `x86_64` - GPU: NVIDIA A10 24GB - TensorRT-LLM: `v0.8.0` (docker build via `make -C docker release_build CUDA_ARCHS="86-real"`) - Triton Inference Server: `r24.02` (docker from...

bug

### System Info x86_64 V100 triton server image: nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 tensorrtllm_backend: v0.7.1 ### Who can help? _No response_ ### Information - [X] The official example scripts - [ ] My own...

bug
triaged

trtllm crashes when I give long context requests within the `max-input-length` limits. I believe it happens when total pending requests reach the `max-num-tokens` limit. But why it's not queuing requests...

### System Info - DGX H100 - TensorrtLlm 0.7.1 ### Who can help? _No response_ ### Information - [X] The official example scripts - [ ] My own modified scripts...

bug
triaged

When a TensorRTLLM is deployed with streaming mode tokens have no white spaces in between streaming chunks. [Link to Issue](https://github.com/triton-inference-server/tensorrtllm_backend/issues/332#issuecomment-2063243340 ) This is because calling tokenizer.decode does a whitespace strip...

**Description** Trying to deploy Mistral-7B with Triton+TensorRT-LLM and running into this issue **Triton Information** Are you using the Triton container or did you build it yourself? nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 **To Reproduce** Steps...

triaged