ragflow [Question]: Why is max_token being used for both input and output

Describe your problem

I was looking at the code of the chat() function in api/db/services/dialog_service.

I noticed that max_tokens is being used to limit the input size to LLM and the check is done in message_fit_in. But then the code below follows message_fit_in :

 if "max_tokens" in gen_conf:
        gen_conf["max_tokens"] = min(
            gen_conf["max_tokens"],
            max_tokens - used_token_count)

And this gen_conf["max_tokens"] is later used in rag/llm/chat_model.py inside chat() function of OllamaChat class:

if "max_tokens" in gen_conf: options["num_predict"] = gen_conf["max_tokens"]

This implies that max_tokens is used to limit the output size instead now. And if that is the case, why is the length of the input message (represented by used_token_count) being extracted from max_tokens?

Thank you for helping!

Jul 07 '24 09:07 PROGRAMMERHAO

In Ollama, the definition of max_tokens indeed is different from others. BTW, you could start the project to follow. Thanks!

Jul 08 '24 04:07 KevinHuSh

max_tokens is the context length of given LLM, for example 16K. gen_conf["max_tokens"] is the output length limit of a single round chat, for example 512.

Nov 27 '24 13:11 yuzhichang