[Bug]: frequency_penalty Parameter Not Working in TensorRT-LLM RC1.1.0rc5
System Info
TensorRT-LLM Version: RC1.1.0rc5
Model: Qwen/Qwen3-14B (same issue occurs with other models)
Command: trtllm-serve /data/Qwen3-14B/ --port 8000 --host 0.0.0.0 --kv_cache_free_gpu_memory_fraction 0.9 --extra_llm_api_options default_config.yaml default_config.yaml : enable_iter_req_stats: True return_perf_metrics: True enable_chunked_prefill: True enable_iter_perf_stats: True guided_decoding_backend: xgrammar
API Client: OpenAI Python Client
Base Image: TensorRT-LLM_rc1.1.0rc5
GPU:H20
Who can help?
@juney-nvidia @Tracin @laikhtewari
Information
- [ ] The official example scripts
- [x] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [x] My own task or dataset (give details below)
Reproduction
Use the provided code example
Set frequency_penalty = 2.0
Observe the occurrence of repeated vocabulary in model output
Compare outputs with different frequency_penalty values (0, 1.0, 2.0) and notice no significant differences
import openai
import httpx
client = openai.OpenAI(
base_url="http://localhost:9823/v1",
api_key="",
http_client=httpx.Client(verify=False)
)
response = client.chat.completions.create(
model="Qwen3-14B",
messages=[
{"role": "system", "content": "Translate from English into Ukrainian."},
{"role": "user", "content": "<p>As per Bijié Wǎng, Bitcoin price continues to face downward pressure...</p>"}
],
extra_body={
"chat_template_kwargs": {"enable_thinking": False}
},
frequency_penalty=2.0, # ⚠️ This parameter is not working
stream_options={"include_usage": False},
temperature=0,
top_p=1,
stream=True
)
Expected behavior
Setting frequency_penalty=2.0 should significantly reduce repeated vocabulary
Higher penalty values should prevent the model from reusing already appeared tokens
Vocabulary diversity in output text should be noticeably improved
actual behavior
Output results remain largely identical regardless of frequency_penalty value (0, 1.0, 2.0)
Repeated vocabulary continues to appear frequently
Parameter adjustments have no noticeable impact on output quality
additional notes
The issue persists in both streaming and non-streaming modes
The same code works properly with other inference frameworks (e.g., vLLM SGLang)
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.