ipex-llm Starting from bigdl-llm[xpu]==2.5.0b20231227, the 32-32 output of gguf models on Arc770 contains <unk>

Issue description

During benchmarking GGUF models on Arc 770, from bigdl-llm[xpu]==2.5.0b20231227, the 32-32 output of gguf models contains randomly. I tested baichuan and llama and both had this issue. There is no such issue with bigdl-llm[xpu]==2.5.0b20231226.

To reproduce

We are using all-in-one/run.py to test it:

https://github.com/intel-analytics/BigDL/blob/main/python/llm/dev/benchmark/all-in-one/run.py

updating run_transformer_int4_gpu() in run.py to use from_gguf() to load model

def run_transformer_int4_gpu(repo_id,
                             local_model_hub,
                             in_out_pairs,
                             warm_up,
                             num_trials,
                             num_beams,
                             low_bit):
    from bigdl.llm.transformers import AutoModel, AutoModelForCausalLM
    from transformers import AutoTokenizer, GPTJForCausalLM, LlamaTokenizer
    import intel_extension_for_pytorch as ipex
    reserved_mem_list = []
    #model_path = get_model_path(repo_id, local_model_hub)
    model_path = "/mnt/disk1/models/gguf/llama-2-7b-chat.Q4_0.gguf"
    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    st = time.perf_counter()
    model, tokenizer = AutoModelForCausalLM.from_gguf(model_path)
    model = model.to('xpu')
    end = time.perf_counter()
    print(">> loading of model costs {}s".format(end - st))
    reserved_mem_list.append(torch.xpu.memory.memory_reserved()/(1024**3))

    model = BenchmarkWrapper(model)

    result = {}
    with torch.inference_mode():
        for in_out in in_out_pairs:
            in_out_len = in_out.split("-")
            in_len = int(in_out_len[0])
            out_len = int(in_out_len[1])
            # As different tokenizer has different encodings,
            # in_len.txt maybe shorter than we need,
            # use much longer context to make sure input length
            test_length = min(in_len*2, 8192)
            while test_length not in [32, 256, 1024, 2048, 8192]:
                test_length = test_length * 2
            input_str = open(f"prompt/{test_length}.txt", 'r').read()
            input_ids = tokenizer.encode(input_str, return_tensors="pt")
            input_ids = input_ids[:, :in_len]
            true_str = tokenizer.batch_decode(input_ids)[0]
            input_ids = tokenizer.encode(true_str, return_tensors="pt").to('xpu')
            actual_in_len = input_ids.shape[1]
            result[in_out] = []
            thread = threading.Thread(target=run_model_in_thread, args=(model, in_out, tokenizer, result, warm_up, num_beams, input_ids, out_len, actual_in_len, num_trials, reserved_mem_list))
            thread.start()
            thread.join()
    model.to('cpu')
    torch.xpu.synchronize()
    torch.xpu.empty_cache()
    del model
    gc.collect()
    return result

Jan 04 '24 02:01 liu-shaojun

This bug is caused by using XMX kernel in a new thread, it won't happen if running model in current thread. And I think its root cause is a bug of oneapi 2023.2 or related to machine or driver. When using ipex 2.1 and oneapi 2024.0, this bug won't happen. Before 1227 version, 32 input won't use XMX kernel, so this bug won't happen.

Shall we just stop starting new threads in benchmark script or disable XMX ? @jason-dai

Jan 04 '24 09:01 MeouSker77

This bug is caused by using XMX kernel in a new thread, it won't happen if running model in current thread. And I think its root cause is a bug of oneapi 2023.2 or related to machine or driver. When using ipex 2.1 and oneapi 2024.0, this bug won't happen. Before 1227 version, 32 input won't use XMX kernel, so this bug won't happen.

Shall we just stop starting new threads in benchmark script or disable XMX ? @jason-dai

We plan to switch to PyTorch 2.1/oneAPI 2024.0 as default; maybe we should document it as a known issue, and use PyTorch 2.1/oneAPI 2024.0 in our nightly/development

Jan 04 '24 09:01 jason-dai