Starting from bigdl-llm[xpu]==2.5.0b20231227, the 32-32 output of gguf models on Arc770 contains <unk>
Issue description
During benchmarking GGUF models on Arc 770, from bigdl-llm[xpu]==2.5.0b20231227, the 32-32 output of gguf models contains
To reproduce
We are using all-in-one/run.py to test it:
https://github.com/intel-analytics/BigDL/blob/main/python/llm/dev/benchmark/all-in-one/run.py
updating run_transformer_int4_gpu() in run.py to use from_gguf() to load model
def run_transformer_int4_gpu(repo_id,
local_model_hub,
in_out_pairs,
warm_up,
num_trials,
num_beams,
low_bit):
from bigdl.llm.transformers import AutoModel, AutoModelForCausalLM
from transformers import AutoTokenizer, GPTJForCausalLM, LlamaTokenizer
import intel_extension_for_pytorch as ipex
reserved_mem_list = []
#model_path = get_model_path(repo_id, local_model_hub)
model_path = "/mnt/disk1/models/gguf/llama-2-7b-chat.Q4_0.gguf"
# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
st = time.perf_counter()
model, tokenizer = AutoModelForCausalLM.from_gguf(model_path)
model = model.to('xpu')
end = time.perf_counter()
print(">> loading of model costs {}s".format(end - st))
reserved_mem_list.append(torch.xpu.memory.memory_reserved()/(1024**3))
model = BenchmarkWrapper(model)
result = {}
with torch.inference_mode():
for in_out in in_out_pairs:
in_out_len = in_out.split("-")
in_len = int(in_out_len[0])
out_len = int(in_out_len[1])
# As different tokenizer has different encodings,
# in_len.txt maybe shorter than we need,
# use much longer context to make sure input length
test_length = min(in_len*2, 8192)
while test_length not in [32, 256, 1024, 2048, 8192]:
test_length = test_length * 2
input_str = open(f"prompt/{test_length}.txt", 'r').read()
input_ids = tokenizer.encode(input_str, return_tensors="pt")
input_ids = input_ids[:, :in_len]
true_str = tokenizer.batch_decode(input_ids)[0]
input_ids = tokenizer.encode(true_str, return_tensors="pt").to('xpu')
actual_in_len = input_ids.shape[1]
result[in_out] = []
thread = threading.Thread(target=run_model_in_thread, args=(model, in_out, tokenizer, result, warm_up, num_beams, input_ids, out_len, actual_in_len, num_trials, reserved_mem_list))
thread.start()
thread.join()
model.to('cpu')
torch.xpu.synchronize()
torch.xpu.empty_cache()
del model
gc.collect()
return result
This bug is caused by using XMX kernel in a new thread, it won't happen if running model in current thread. And I think its root cause is a bug of oneapi 2023.2 or related to machine or driver. When using ipex 2.1 and oneapi 2024.0, this bug won't happen. Before 1227 version, 32 input won't use XMX kernel, so this bug won't happen.
Shall we just stop starting new threads in benchmark script or disable XMX ? @jason-dai
This bug is caused by using XMX kernel in a new thread, it won't happen if running model in current thread. And I think its root cause is a bug of oneapi 2023.2 or related to machine or driver. When using ipex 2.1 and oneapi 2024.0, this bug won't happen. Before 1227 version, 32 input won't use XMX kernel, so this bug won't happen.
Shall we just stop starting new threads in benchmark script or disable XMX ? @jason-dai
We plan to switch to PyTorch 2.1/oneAPI 2024.0 as default; maybe we should document it as a known issue, and use PyTorch 2.1/oneAPI 2024.0 in our nightly/development