ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: Why is max_token being used for both input and output

Open PROGRAMMERHAO opened this issue 1 year ago • 1 comments

Describe your problem

I was looking at the code of the chat() function in api/db/services/dialog_service.

I noticed that max_tokens is being used to limit the input size to LLM and the check is done in message_fit_in. But then the code below follows message_fit_in :

 if "max_tokens" in gen_conf:
        gen_conf["max_tokens"] = min(
            gen_conf["max_tokens"],
            max_tokens - used_token_count)

And this gen_conf["max_tokens"] is later used in rag/llm/chat_model.py inside chat() function of OllamaChat class:

if "max_tokens" in gen_conf: options["num_predict"] = gen_conf["max_tokens"]

This implies that max_tokens is used to limit the output size instead now. And if that is the case, why is the length of the input message (represented by used_token_count) being extracted from max_tokens?

Thank you for helping!

PROGRAMMERHAO avatar Jul 07 '24 09:07 PROGRAMMERHAO

In Ollama, the definition of max_tokens indeed is different from others. BTW, you could start the project to follow. Thanks!

KevinHuSh avatar Jul 08 '24 04:07 KevinHuSh

max_tokens is the context length of given LLM, for example 16K. gen_conf["max_tokens"] is the output length limit of a single round chat, for example 512.

yuzhichang avatar Nov 27 '24 13:11 yuzhichang