tensorrtllm_backend Qwen2-14B generate_stream return some garbled code

Qwen2-14B generate_stream return some garbled code

Open kazyun opened this issue 1 year ago • 5 comments

Description stream request return garbled code

Triton Information tritonserver 24:08 run container with this image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

To Reproduce Steps to reproduce the behavior.

This issue only occurs when using a streaming request. v2/models/tensorrt_llm_bls/generate_stream (both ensemble) payload = { "text_input": QWEN_PROMPT_TEMPLATE.format(input_text=prompt), "max_tokens": max_tokens, "stream": True, }

The screenshot below shows the results of non-streaming and streaming requests. Dingtalk_20240924143637

Expected behavior same result with v2/models/tensorrt_llm_bls/generate

Sep 24 '24 06:09 kazyun

tensorrtllm_backend tensorrtllm_backend copied to clipboard

Qwen2-14B generate_stream return some garbled code

tensorrtllm_backend
tensorrtllm_backend copied to clipboard