tensorrtllm_backend
tensorrtllm_backend copied to clipboard
Qwen2-14B generate_stream return some garbled code
Description stream request return garbled code
Triton Information tritonserver 24:08 run container with this image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
To Reproduce Steps to reproduce the behavior.
This issue only occurs when using a streaming request. v2/models/tensorrt_llm_bls/generate_stream (both ensemble) payload = { "text_input": QWEN_PROMPT_TEMPLATE.format(input_text=prompt), "max_tokens": max_tokens, "stream": True, }
The screenshot below shows the results of non-streaming and streaming requests.
Expected behavior same result with v2/models/tensorrt_llm_bls/generate