ipex-llm
ipex-llm copied to clipboard
Baichuan2-13B with bigdl-bf16 does not apply greedy_search when calling model.generate
This is a bigdl-bf16 model, where model_path
is to a Baichuan2-13B-Chat:
# load
model = AutoModelForCausalLM.from_pretrained(model_path,
optimize_model=True,
torch_dtype=torch.bfloat16,
load_in_low_bit="bf16",
trust_remote_code=True,
use_cache=True)
# inference
original_output = model.generate(input_ids=input_ids,
use_cache=False,
max_new_tokens=args.n_predict,
do_sample=False)
It is found greedy_search is not called as expected when using model.generate
API.
By contrast, bigdl-int4 calls greedy_search
as expected while applying the same style API as below:
draft_model = AutoModel.from_pretrained(model_path,
load_in_4bit=True,
optimize_model=True,
trust_remote_code=True,
use_cache=True)
draft_output = draft_model.generate(input_ids=input_ids,
use_cache=True,
max_new_tokens=args.n_predict,
do_sample=False)
And this will finally influence the outputs, where bigdl-int4 and ipex-bf16 both use greedy _search
and thus give closer answers, while bigdl-bf16 shows an difference:
dtype | Output Tokens |
---|---|
bigdl-int4 | 1. 在创建栅格折痕的过程中,折叠不对成可能会导致整个折纸过程失败。这种错误可能会像蝴蝶效应一样,导致最终折叠出的玫瑰花形状不准确或无法成形。2. 在制作立体基座的过程中,高台的布局不仅要考虑长和宽这两个维度上的规整衬度和对称分布,还需要同时保证高这个维度上的整齐。如果高台布局不符合要求,可能会导致玫瑰花形状不准确或无法成形。3. 在完成花瓣修饰的阶段,如果花瓣形状没有接近自然中的玫瑰花瓣外形,可能会导致最终折叠出的玫瑰花不够逼真。此外,如果调整花瓣形状的力道控制不当, |
ipex-bf16 | 折纸失败的原因可能有很多,但根据上述描述,以下几个步骤如果做错了很大可能会导致最终折叠失败:1. 创建栅格折痕:如果折叠过程中出现了折叠不对成的情况,可能会导致折纸失败。这种情况可能会像蝴蝶效应一样,一开始只是毫厘之差,最后可能就是天壤之别。2. 制作立体基座:在这个阶段,如果高台的布局没有考虑长和宽这两个维度上的规整衬度和对称分布,以及高这个维度上的整齐,可能会导致折纸失败。3. 完成花瓣修饰:在这个阶段,如果花瓣的形状没有通过自然的曲线去不断 |
bigdl-bf16 | 首先,在创建栅格折痕的过程中,如果出现折叠不对成的情况,可能会导致最终的折叠失败。这是因为折叠不对成可能会影响到后续的立体基座制作,甚至可能导致整个折纸过程的混乱。其次,在制作立体基座的过程中,如果高台的布局没有考虑到长、宽、高三个维度上的整齐和对称分布,也可能会导致最终的折叠失败。这是因为高台的布局直接影响到花瓣的形状和排列,从而影响整个玫瑰花的形状。最后,在完成花瓣修饰的阶段,如果没有充分理解大自然中玫瑰花的外形,并借助自然的曲线去不断修正花瓣的形状,也可能导致最终的折叠失败。这是因为花瓣的形状直接 |
Hope bigdl-bf16's service owner can help to fix it pls.
I can't reproduce this issue.
Based on my test, below code will call self.greedy_search
.
code
import torch
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
# load
model = AutoModelForCausalLM.from_pretrained(model_path,
optimize_model=True,
torch_dtype=torch.bfloat16,
load_in_low_bit="bf16",
trust_remote_code=True,
use_cache=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
input_str = "tell me a story"
input_ids = tokenizer.encode(input_str, return_tensors="pt")
# inference
original_output = model.generate(input_ids=input_ids,
use_cache=False,
max_new_tokens=13,
do_sample=False)
output_str = tokenizer.decode(original_output[0], skip_special_tokens=True)
print(original_output)
print(output_str)
Is import intel_extension_for_pytorch as ipex
necessary? As import will do some init works. @rnwang04
Is
import intel_extension_for_pytorch as ipex
necessary? As import will do some init works. @rnwang04
It's not necessary, here use ipex as I validate in a GPU conda env. I have double checked in a CPU conda env, and confirm that it does use greedy search. And I also found that our bf16 has same output with native bf16 in CPU env.
- our bf16
=================enter greedy search================
tensor([[83680, 1643, 1346, 3028, 1670, 1346, 1750, 1777, 1438, 1738,
33105, 72, 5, 13602, 5920, 1346, 1750]])
tell me a story about a time when you were scared.
Once upon a time
- native bf16
=================enter greedy search================
tensor([[83680, 1643, 1346, 3028, 1670, 1346, 1750, 1777, 1438, 1738,
33105, 72, 5, 13602, 5920, 1346, 1750]])
tell me a story about a time when you were scared.
Once upon a time