ipex-llm 模型推理问题 Model inference issue

我在完成环境的配置后尝试运行example中chatglm2的代码，但是发现结果输出非常慢，gpu也没有跑满，速度远不及在cpu上运行 After completing the environment setup, I attempted to run the code for the chatglm2 example, but I noticed that the output results were extremely slow. Additionally, the GPU was not fully utilized, and the speed was much slower compared to running it on the CPU. 性能监测图片如下： Performance monitoring image as follows: 终端截图如下，如下的回答大概用了十分钟才输出结果，但是推理时间却显示为1.89s The terminal screenshot is as follows. The response took approximately ten minutes to output the result, but the inference time is displayed as 1.89s. 代码如下 code follows

#
# Copyright 2016 The BigDL Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import torch
import time
import argparse

from transformers import AutoModel, AutoTokenizer
from bigdl.llm import optimize_model

# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L1007
CHATGLM_V2_PROMPT_FORMAT = "问：{prompt}\n\n答："

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model')
    # parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/chatglm2-6b",
    #                     help='The huggingface repo id for the ChatGLM2 model to be downloaded'
    #                          ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="AI是什么？",
                        help='Prompt to infer')
    parser.add_argument('--n-predict', type=int, default=32,
                        help='Max tokens to predict')

    args = parser.parse_args()
    model_path = r'D:\Code\chatglm2-6b'

    # Load model
    model = AutoModel.from_pretrained(model_path,
                                      trust_remote_code=True,
                                      torch_dtype='auto',
                                      low_cpu_mem_usage=True)

    # With only one line to enable BigDL-LLM optimization on model
    # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the optimize_model function.
    # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU.
    model = optimize_model(model)

    model = model.to('xpu')

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
    # Generate predicted tokens
    with torch.inference_mode():
        prompt = CHATGLM_V2_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        # ipex model needs a warmup, then inference time can be accurate
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)

        # start inference
        st = time.time()
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        torch.xpu.synchronize()
        end = time.time()
        output = output.cpu()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Output', '-'*20)
        print(output_str)

Mar 22 '24 02:03 SJF-ECNU

我再次运行了一次，这里显示的推理时间异常地长

Mar 22 '24 03:03 SJF-ECNU

Please try https://bigdl.readthedocs.io/en/latest/doc/LLM/Quickstart/benchmark_quickstart.html

Mar 22 '24 03:03 jason-dai

It looks like you forget set SYCL_CACHE_PERSISTENT=1, see https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration.

Mar 25 '24 02:03 qiuxin2012