[Question]: Why is max_token being used for both input and output
Describe your problem
I was looking at the code of the chat() function in api/db/services/dialog_service.
I noticed that max_tokens is being used to limit the input size to LLM and the check is done in message_fit_in. But then
the code below follows message_fit_in :
if "max_tokens" in gen_conf:
gen_conf["max_tokens"] = min(
gen_conf["max_tokens"],
max_tokens - used_token_count)
And this gen_conf["max_tokens"] is later used in rag/llm/chat_model.py inside chat() function of OllamaChat class:
if "max_tokens" in gen_conf: options["num_predict"] = gen_conf["max_tokens"]
This implies that max_tokens is used to limit the output size instead now. And if that is the case, why is the length of the input message (represented by used_token_count) being extracted from max_tokens?
Thank you for helping!
In Ollama, the definition of max_tokens indeed is different from others. BTW, you could start the project to follow. Thanks!
max_tokens is the context length of given LLM, for example 16K. gen_conf["max_tokens"] is the output length limit of a single round chat, for example 512.