tensorrtllm_backend
tensorrtllm_backend copied to clipboard
The Triton TensorRT-LLM Backend
TRT-LLM version: **0.5.0** Triton server version: **23.10** GPU type: A100, 80GB, with MIG enabled (20gb GPU memory per split, 3 splits per node). I am trying to run a Falcon-7B...
Hi folk. I am currently testing on the AWS EC2 G5G series (AWS Graviton2, ARM64). Here is my error output. ``` I1122 15:20:46.453111 168 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch I1122 15:20:46.453159 168...
Is there a plan to add support for block reuse in beam search? Could be very helpful. When I try to use it I get the exception of: Block reuse...
### System Info Intel(R) Xeon(R) CPU @ 2.20GHz Architecture: x86_64 NVIDIA A100-SXM4-40G Ubuntu ### Who can help? _No response_ ### Information - [X] The official example scripts - [X] My...
Try to use perf_analyzer as follows deploying LLaMA2-13B with triton: python scripts/launch_triton_server.py --world_size 2 --model_repo triton_model_repo perf_analyzer -m ensemble -i grpc --shape "bad_words:1" --shape "max_tokens:1" --shape "stop_words:1" --shape "text_input:1" --streaming...
### System Info arch - x86-64 gpu - rtx3070 docker image nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 tensorRT-LLM-backend tag - 0.7.2 tensorRT-LLM tag - 0.7.1 (80bc07510ac4ddf13c0d76ad295cdb2b75614618) ### Who can help? @juney-nvidia ### Information - [X]...
I see that Triton backend creates an [object of GptManager](https://github.com/triton-inference-server/tensorrtllm_backend/blob/bf5e9007a3f16c7fc76cb156a3362d1caae198dd/inflight_batcher_llm/src/model_instance_state.cc#L388) which gets passed the engine dir. However, I unable to see any code that shows how this GptManager is being...
We have implemented a custom postprocessing step in beam search decoding where we filter some outputs out of the final beam output. In a case where we are left with...
When unexpected large bursts in requests come to my application I would like to be able to limit the number of requests that will be accepted by trtllm backend. I...
This PR is about adding a speculative decoding example.