bhsueh_NV
bhsueh_NV
It is at inflight_batcher_llm/preprocessing/1/model.py
Could you try print the related variables in https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.7.1/all_models/inflight_batcher_llm/preprocessing/1/model.py#L210?
Because we cannot reproduce your issue, so we cannot provide the timeline about fixing. Also, might you try the latest main branch because there are many updates after v0.7.1.
Could you share the full reproduced steps instead of only sharing the scripts of launching server? Also, please check again that you really use the latest main branch. For example,...
Have you tried the `tensorrt_llm_bls` module?
Here is example https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm#running-lora-inference-with-inflight-batching
Currently, TRT LLM backend does not support such requirement.
Could you try latest main branch?
You need to setup some runtime parameters like `triton_max_batch_size`, `max_beam_width`, ... (The parameters like `${xxx}`). Here is document https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/gemma.md#end-to-end-workflow-to-run-sp-model.
The tensorrt version of `nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3` is v0.7.0, so you will encounter such issue when you build engine with the v0.7.1. I sugges using the docker file to build docker image...