vllm Some models such as chatglm3-6b, did not obtain expected result compare to the huggingface version.

Hi team, I tested chatglm3-6b with vLLM but obtained poor results because the tokenizer did not perform as model expect.

first i test huggingface model.chat function.


from transformers import AutoModel, AutoConfig, AutoTokenizer
from torch import cuda, bfloat16
import transformers


model_id = '/data/models/ZhipuAI/chatglm3-6b'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'


# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model_config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True, config=model_config, quantization_config=bnb_config, device_map={"": device})
model.eval()
print(f"load model to {device}")

his = []
res, his = model.chat(tokenizer, "你是谁？", history=his)
print(res)

obtain the good result:

我是一个名为 ChatGLM3-6B 的人工智能助手，是基于清华大学 KEG 实验室和智谱 AI 公司于 2023 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。

Then I test the huggingface 'model.generate' function directly.

# Unable obtain the expected results directly using "generate" function
input_ids = tokenizer("你是谁？", return_tensors="pt").input_ids.cuda()
output_ids = model.generate(input_ids)[0]
tokenizer.decode(output_ids.tolist())

obtain the poor result:

'[gMASK]sop 你是谁？你有什么目的？你有什么能力？\n \n\n我是一个'

By closely diving into 'tokenizer.build_chat_input' and 'model.chat' function, I make some modifications to the input and configuration.

eos_token_id = [tokenizer.eos_token_id, tokenizer.get_command("<|user|>"),
                        tokenizer.get_command("<|observation|>")]
gen_kwargs = {"max_length": 8192, "num_beams": 1, "do_sample": True, "top_p": 0.8,
                      "temperature": 0.8, "logits_processor": None, "eos_token_id": eos_token_id}
input_ids = tokenizer.build_chat_input("你是谁？", history=[], role="user").input_ids.cuda()
output_ids = model.generate(input_ids, **gen_kwargs)[0].tolist()[len(input_ids[0]):-1]
output = tokenizer.decode(output_ids)
print(output)

Now the result is same as 'model.chat'

 我是一个名为 ChatGLM3-6B 的人工智能助手，是基于清华大学 KEG 实验室和智谱 AI 公司于 2023 年共同训练的语言模型开发的。我的任务是针对用户的问题和要求提供适当的答复和支持。

Finally I test vllm.

from vllm import LLM, SamplingParams

model_id = '/data/models/ZhipuAI/chatglm3-6b'
sampling = SamplingParams(use_beam_search=False, top_p=0.8, temperature=0.8)
model = LLM(model_id, dtype="bfloat16", trust_remote_code=True)
output = model.generate("你是谁？", sampling)
print(output[0].outputs[0].text)

obtain results similar to the huggingface 'model.generate' function

你有什么目的？你有什么能力？

But we can't achive the desired effect by simply modifying it because the tokenizer calls have been intergrated well unless we modify the original code in vLLM.

So vLLM had any plan to solve problems like this?

Dec 15 '23 03:12 EvilPsyCHo

Yes, I have the same issue. I speculate it might be due to the differences in algorithms between VLLM and HF. I wonder if the official team could address and fix this discrepancy in the output.

Dec 15 '23 09:12 821484459

I guess the chat_input(tokenizer.build_chat_input("你是谁？", history=[], role="user").input_ids.cuda()) is different with input_ids (tokenizer("你是谁？", return_tensors="pt").input_ids.cuda()).

Dec 19 '23 03:12 shiqingzhangCSU

Yes , build_chat_input will construct special token "<|user|>" and concat it to tokenizer("你是谁？", return_tensors="pt").input_ids.

The method for constructing chat prompt for different model varies, so vLLM should either adapt to different models or provide a convenient user-customizable interface.

Dec 19 '23 14:12 EvilPsyCHo

vLLM team had any plan for this?

Dec 19 '23 14:12 EvilPsyCHo

可以这样操作

llm = LLM(
    "/home/dev/model/chatglm3-6b/", tensor_parallel_size=1, trust_remote_code=True
)
tokenizer = llm.llm_engine.tokenizer
input_ids = tokenizer.build_chat_input("你是谁？", history=[], role="user")[
    "input_ids"
].tolist()
output = llm.generate(prompt_token_ids=input_ids, sampling_params=sampling)

Jan 06 '24 15:01 shell-nlp

输出的时候，总是带上了prompt信息，如何把这个去掉了？强制说了不需要输出prompt

Jan 24 '24 09:01 PeterXiaTian

vllm vllm copied to clipboard

Some models such as chatglm3-6b, did not obtain expected result compare to the huggingface version.

vllm
vllm copied to clipboard