FastChat
FastChat copied to clipboard
prompt will always be truncated
when call "/v1/chat/completions", there will call the function
check_length to compute max_new_tokens use min(max_tokens, context_len - token_num) where token_num is len(tokinzer(pormot).input_ids),
but when compute max_src_len use max_src_len = context_len - max_new_tokens - 1 in inference.py, this will lead to truncate the prompt every time.
such as context_len=4096, token_num=len(tokenizer(promot).input_ids)=8, max_new_tokens = 4096-8= 4088
then max_src_len = context_len - max_new_tokens - 1 = 4096-4088-1=7
when use input_ids = input_ids[-max_src_len:] to truncate the prompt, this will drop the first token
all links: https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py#L437 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/base_model_worker.py#L152 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/inference.py#L97 https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/openai_api_server.py#L169