ragflow [Question]: Why do we need to hard-code the value of max tokens

Describe your problem

refer code.

    if not llm:
        llm = TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id) if not fid else \
            TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id, llm_factory=fid)
        if not llm:
            raise LookupError("LLM(%s) not found" % dialog.llm_id)
        max_tokens = 8192
    else:
        max_tokens = llm[0].max_tokens
   ...
    if "max_tokens" in gen_conf:
        gen_conf["max_tokens"] = min(
            gen_conf["max_tokens"],
            max_tokens - used_token_count)

Can we just use max_tokens of dialog.llm_setting?

Oct 12 '24 10:10 gliffcheung

This is for the case we can't find max token length for assigned LLM.

Oct 14 '24 01:10 KevinHuSh

Before reviewing the code, I was puzzled as to why the answers were still getting cut off even after setting a high value for max_tokens in dialog.llm_setting. Why is there such a limit that can't be adjusted on the front end? @KevinHuSh

Oct 14 '24 03:10 gliffcheung

What kind of LLM did you use, let me check if there's a bug or something?

Oct 15 '24 02:10 KevinHuSh

I use Xinference to add model qwen2.5. @KevinHuSh

Oct 16 '24 03:10 gliffcheung

RAGFlow does not know the context length of models added through XInference, which needs to be improved.

Oct 17 '24 01:10 KevinHuSh

And I got the similar problem that I even modify the max_tokens setting logic.

I modify api/db/services/dialog_service.py like:

    ...

    # Avoid messages fit
    # used_token_count, msg = message_fit_in(msg, int(max_tokens * 0.97))
    assert len(msg) >= 2, f"message_fit_in has bug: {msg}"
    prompt = msg[0]["content"]

    if "max_tokens" in gen_conf:
        # Do NOT consider the input message tokens
        gen_conf["max_tokens"] = min(gen_conf["max_tokens"], max_tokens)
    
    ...

And I still get a cut off after setting a high value for max_tokens in dialog.llm_setting. I used deepseek-chat and openai interface didn't get a stop signal. It was weird and I was considering if it was a frontend problem.

Nov 14 '24 05:11 ShAw7ock

It‘s definitely controled by context length of LLM.

Nov 15 '24 01:11 KevinHuSh

The following piece of code has a defect: sometimes, even when the value of max_tokens is set to be greater than 8192 in the chat assistant, max_tokens still gets assigned a value of 8192.

    if not llm:
        # Model name is provided by tenant, but not system built-in
        llm = TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id) if not model_provider else \
            TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id, llm_factory=model_provider)
        if not llm:
            raise LookupError("LLM(%s) not found" % dialog.llm_id)
        max_tokens = 8192
    else:
        max_tokens = llm[0].max_tokens

I have modified it as follows:

   max_tokens  = 8192
    if not llm:
        # Model name is provided by tenant, but not system built-in
        llm = TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id) if not model_provider else \
            TenantLLMService.query(tenant_id=dialog.tenant_id, llm_name=llm_id, llm_factory=model_provider)
        if not llm:
            raise LookupError("LLM(%s) not found" % dialog.llm_id)
    if llm and llm[0] and hasattr(llm[0], 'max_tokens'):
        max_tokens = llm[0].max_tokens

Dec 27 '24 01:12 danny-zhu

It‘s definitely controled by context length of LLM.

How to config in agent part? I meet same issue in agent. Uploading 微信截图_20250107123015.png…

Jan 07 '25 04:01 sliontc

@KevinHuSh Have we fixed this issue? I'm using the latest v0.17.2-slim and set xinference(deepseek-r1-qwen[max_token:16384]&bge-m3[max_token:8192]), when I use my knowledge base for "chat" and "search" tab, I encountered the situation where the reply is incomplete. ps:only the answer cause by "hello" will reply completely.(may be the max tokens limit is very small even though i set a large value)

Apr 07 '25 02:04 Sos-Zachary