[Question]: Why do we need to hard-code the value of max tokens
Describe your problem
refer code.
if not llm:
llm = TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id) if not fid else \
TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id, llm_factory=fid)
if not llm:
raise LookupError("LLM(%s) not found" % dialog.llm_id)
max_tokens = 8192
else:
max_tokens = llm[0].max_tokens
...
if "max_tokens" in gen_conf:
gen_conf["max_tokens"] = min(
gen_conf["max_tokens"],
max_tokens - used_token_count)
Can we just use max_tokens of dialog.llm_setting?
This is for the case we can't find max token length for assigned LLM.
Before reviewing the code, I was puzzled as to why the answers were still getting cut off even after setting a high value for max_tokens in dialog.llm_setting.
Why is there such a limit that can't be adjusted on the front end? @KevinHuSh
What kind of LLM did you use, let me check if there's a bug or something?
I use Xinference to add model qwen2.5. @KevinHuSh
RAGFlow does not know the context length of models added through XInference, which needs to be improved.
And I got the similar problem that I even modify the max_tokens setting logic.
I modify api/db/services/dialog_service.py like:
...
# Avoid messages fit
# used_token_count, msg = message_fit_in(msg, int(max_tokens * 0.97))
assert len(msg) >= 2, f"message_fit_in has bug: {msg}"
prompt = msg[0]["content"]
if "max_tokens" in gen_conf:
# Do NOT consider the input message tokens
gen_conf["max_tokens"] = min(gen_conf["max_tokens"], max_tokens)
...
And I still get a cut off after setting a high value for max_tokens in dialog.llm_setting. I used deepseek-chat and openai interface didn't get a stop signal. It was weird and I was considering if it was a frontend problem.
It‘s definitely controled by context length of LLM.
The following piece of code has a defect: sometimes, even when the value of max_tokens is set to be greater than 8192 in the chat assistant, max_tokens still gets assigned a value of 8192.
if not llm:
# Model name is provided by tenant, but not system built-in
llm = TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id) if not model_provider else \
TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id, llm_factory=model_provider)
if not llm:
raise LookupError("LLM(%s) not found" % dialog.llm_id)
max_tokens = 8192
else:
max_tokens = llm[0].max_tokens
I have modified it as follows:
max_tokens = 8192
if not llm:
# Model name is provided by tenant, but not system built-in
llm = TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id) if not model_provider else \
TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id, llm_factory=model_provider)
if not llm:
raise LookupError("LLM(%s) not found" % dialog.llm_id)
if llm and llm[0] and hasattr(llm[0], 'max_tokens'):
max_tokens = llm[0].max_tokens
It‘s definitely controled by context length of LLM.
How to config in agent part? I meet same issue in agent.
@KevinHuSh Have we fixed this issue?
I'm using the latest v0.17.2-slim and set xinference(deepseek-r1-qwen[max_token:16384]&bge-m3[max_token:8192]), when I use my knowledge base for "chat" and "search" tab, I encountered the situation where the reply is incomplete.
ps:only the answer cause by "hello" will reply completely.(may be the max tokens limit is very small even though i set a large value)