bhsueh_NV

Results 639 comments of bhsueh_NV

It is at inflight_batcher_llm/preprocessing/1/model.py

Could you try print the related variables in https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.7.1/all_models/inflight_batcher_llm/preprocessing/1/model.py#L210?

Because we cannot reproduce your issue, so we cannot provide the timeline about fixing. Also, might you try the latest main branch because there are many updates after v0.7.1.

Could you share the full reproduced steps instead of only sharing the scripts of launching server? Also, please check again that you really use the latest main branch. For example,...

Here is example https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm#running-lora-inference-with-inflight-batching

Currently, TRT LLM backend does not support such requirement.

You need to setup some runtime parameters like `triton_max_batch_size`, `max_beam_width`, ... (The parameters like `${xxx}`). Here is document https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/gemma.md#end-to-end-workflow-to-run-sp-model.

The tensorrt version of `nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3` is v0.7.0, so you will encounter such issue when you build engine with the v0.7.1. I sugges using the docker file to build docker image...