[Bug]: frequency_penalty Parameter Not Working in TensorRT-LLM RC1.1.0rc5

Open 0xd8b opened this issue 1 month ago • 0 comments

System Info

TensorRT-LLM Version: RC1.1.0rc5

Model: Qwen/Qwen3-14B (same issue occurs with other models)

Command: trtllm-serve /data/Qwen3-14B/ --port 8000 --host 0.0.0.0 --kv_cache_free_gpu_memory_fraction 0.9 --extra_llm_api_options default_config.yaml default_config.yaml : enable_iter_req_stats: True return_perf_metrics: True enable_chunked_prefill: True enable_iter_perf_stats: True guided_decoding_backend: xgrammar

API Client: OpenAI Python Client

Base Image: TensorRT-LLM_rc1.1.0rc5

GPU:H20

Who can help?

@juney-nvidia @Tracin @laikhtewari

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

Use the provided code example

Set frequency_penalty = 2.0

Observe the occurrence of repeated vocabulary in model output

Compare outputs with different frequency_penalty values (0, 1.0, 2.0) and notice no significant differences

import openai
import httpx

client = openai.OpenAI(
    base_url="http://localhost:9823/v1",
    api_key="",
    http_client=httpx.Client(verify=False)
)

response = client.chat.completions.create(
    model="Qwen3-14B",
    messages=[
        {"role": "system", "content": "Translate from English into Ukrainian."},
        {"role": "user", "content": "<p>As per Bijié Wǎng, Bitcoin price continues to face downward pressure...</p>"}
    ],
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False}
    },
    frequency_penalty=2.0,  # ⚠️ This parameter is not working
    stream_options={"include_usage": False},
    temperature=0,
    top_p=1,
    stream=True
)

Expected behavior

Setting frequency_penalty=2.0 should significantly reduce repeated vocabulary

Higher penalty values should prevent the model from reusing already appeared tokens

Vocabulary diversity in output text should be noticeably improved

actual behavior

Output results remain largely identical regardless of frequency_penalty value (0, 1.0, 2.0)

Repeated vocabulary continues to appear frequently

Parameter adjustments have no noticeable impact on output quality

additional notes

The issue persists in both streaming and non-streaming modes

The same code works properly with other inference frameworks (e.g., vLLM SGLang)

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Nov 21 '25 15:11 0xd8b