ipex-llm RuntimeError: PyTorch is not linked with support for xpu devices

RuntimeError: PyTorch is not linked with support for xpu devices

Install BigDL GPU version on Windows 11 as https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html

when execute the code as below, the model is chatglm3-6b

import torch
import time
import argparse
import numpy as np

from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer

# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://github.com/THUDM/ChatGLM3/blob/main/PROMPT.md
CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM3 model')
    parser.add_argument('--repo-id-or-model-path', type=str, default="d:/chatglm3-6b",
                        help='The huggingface repo id for the ChatGLM3 model to be downloaded'
                             ', or the path to the huggingface checkpoint folder')
    parser.add_argument('--prompt', type=str, default="AI是什么？",
                        help='Prompt to infer')
    parser.add_argument('--n-predict', type=int, default=32,
                        help='Max tokens to predict')

    args = parser.parse_args()
    model_path = args.repo_id_or_model_path

    # Load model in 4 bit,
    # which convert the relevant layers in the model into INT4 format
    model = AutoModel.from_pretrained(model_path,
                                      load_in_4bit=True,
                                      trust_remote_code=True)
    
    model.save_low_bit("bigdl_chatglm3-6b-q4_0.bin")
    #run the optimized model on Intel GPU
    model = model.to('xpu')

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                              trust_remote_code=True)
    
    # Generate predicted tokens
    with torch.inference_mode():
        prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt=args.prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt")
        st = time.time()
        # if your selected model is capable of utilizing previous key/value attentions
        # to enhance decoding speed, but has `"use_cache": false` in its model config,
        # it is important to set `use_cache=True` explicitly in the `generate` function
        # to obtain optimal performance with BigDL-LLM INT4 optimizations
        output = model.generate(input_ids,
                                max_new_tokens=args.n_predict)
        end = time.time()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

the Error will occur:

Does BigDL support to run ChatGLM3-6b on ARC GPU right now?

Dec 23 '23 04:12 openvino-book

RuntimeError: PyTorch is not linked with support for xpu devices

It seems the installed PyTorch does not support XPU. Can you share the specific PyTorch version installed, and try if it works with Arc GPU (even without using BigDL)?

Does BigDL support to run ChatGLM3-6b on ARC GPU right now?

Yes, it supports ChatGLM3-6B on Arc GPU

Dec 23 '23 08:12 jason-dai

add import intel_extension_for_pytorch as ipex before .to('xpu'), although we don't use ipex manually, it is still needed to run GPU

Dec 25 '23 01:12 MeouSker77

and also add input_ids = input_ids.to('xpu')

Dec 25 '23 01:12 MeouSker77

and also add input_ids = input_ids.to('xpu')

Thank you, @MeouSker77 , it works and solve the RuntimeError: PyTorch is not linked with support for xpu devices

The code is modified as below:

import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex

CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)
# run the optimized model on Intel GPU
model = model.to('xpu')

# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
# 制作ChatGLM3格式提示词    
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")

# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
st = time.time()
# 执行推理计算，生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

However, another runtime error happen: RuntimeError: The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)

(llm_gpu) D:>python chatglm3_infer_gpu.py C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\Users\OV\anaconda3\envs\llm_gpu\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.57it/s] 2023-12-26 09:55:56,907 - INFO - Converting the current model to sym_int4 format...... Traceback (most recent call last): File "D:\chatglm3_infer_gpu.py", line 29, in output = model.generate(input_ids,max_new_tokens=32) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\transformers\generation\utils.py", line 1538, in generate return self.greedy_search( File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\transformers\generation\utils.py", line 2362, in greedy_search outputs = self( File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 937, in forward transformer_outputs = self.transformer( File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 152, in chatglm2_model_forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 640, in forward layer_ret = layer( File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 542, in forward layernorm_output = self.input_layernorm(hidden_states) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\llm_gpu\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 83, in chatglm_rms_norm_forward result = linear_q4_0.fused_rms_norm(hidden_states, RuntimeError: The number of work-items in each dimension of a work-group cannot exceed {512, 512, 512} for this device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)

Could you tell me how to solve it to make ChatGLM3-6B run on the A770? Thank you very much in advance!

Dec 26 '23 02:12 openvino-book

Can you try this code to get the device name of 'xpu:0'?

name = torch.xpu.get_device_name(0)
print(name)

I'm afraid the default xpu device is not A770

Dec 26 '23 02:12 MeouSker77

Can you try this code to get the device name of 'xpu:0'?
name = torch.xpu.get_device_name(0)
print(name)
I'm afraid the default xpu device is not A770

When I run the code, I got the AttributeError: ** module 'torch' has no attribute 'xpu'**

Dec 26 '23 14:12 openvino-book

Add import intel_extension_for_pytorch as ipex?

Dec 26 '23 14:12 jason-dai

print(name)

1703643815331

Dec 27 '23 02:12 openvino-book

how to set the device as Iris Xe or Arc A770?

Dec 27 '23 02:12 openvino-book

how to set the device as Iris Xe or Arc A770?

change all to('xpu') to to('xpu:1') to use A770

Dec 27 '23 02:12 MeouSker77

xpu:1

import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex

CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)
# run the optimized model on Intel GPU
model = model.to('xpu:1')

# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
# 制作ChatGLM3格式提示词    
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")

# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu:1')
st = time.time()
# 执行推理计算，生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

RuntimeError: could not create a primitive

(bigdl) D:>python chatglm3_infer_gpu.py C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torchvision\io\image.py:13: UserWarning: Failed to load image Python extension: 'Could not find module 'C:\Users\OV\anaconda3\envs\bigdl\Lib\site-packages\torchvision\image.pyd' (or one of its dependencies). Try using the full path with constructor syntax.'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( Loading checkpoint shards: 100%|███████████████████████████████| 7/7 [00:04<00:00, 1.53it/s] 2023-12-27 10:32:52,956 - INFO - Converting the current model to sym_int4 format...... onednn_verbose,info,oneDNN v3.3.0 (commit 887fb044ccd6308ed1780a3863c2c6f5772c94b3) onednn_verbose,info,cpu,runtime:threadpool,nthr:10 onednn_verbose,info,cpu,isa:Intel AVX2 with Intel DL Boost onednn_verbose,info,gpu,runtime:DPC++ onednn_verbose,info,gpu,engine,0,backend:Level Zero,name:Intel(R) Iris(R) Xe Graphics,driver_version:1.3.26957,binary_kernels:enabled onednn_verbose,info,gpu,engine,1,backend:Level Zero,name:Intel(R) Arc(TM) A770M Graphics,driver_version:1.3.26957,binary_kernels:enabled onednn_verbose,info,graph,backend,0:dnnl_backend onednn_verbose,info,experimental features are enabled onednn_verbose,info,use batch_normalization stats one pass is enabled onednn_verbose,primitive,info,template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time onednn_verbose,graph,info,template:operation,engine,partition_id,partition_kind,op_names,data_formats,logical_tensors,fpmath_mode,backend,exec_time onednn_verbose,common,error,level_zero,errcode 1879048196 Traceback (most recent call last): File "D:\chatglm3_infer_gpu.py", line 29, in output = model.generate(input_ids,max_new_tokens=32) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\transformers\generation\utils.py", line 1538, in generate return self.greedy_search( File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\transformers\generation\utils.py", line 2362, in greedy_search outputs = self( File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 937, in forward transformer_outputs = self.transformer( File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 152, in chatglm2_model_forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 640, in forward layer_ret = layer( File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV/.cache\huggingface\modules\transformers_modules\chatglm3-6b\modeling_chatglm.py", line 544, in forward attention_output, kv_cache = self.self_attention( File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 353, in chatglm2_attention_forward_8eb45c context_layer = self.core_attention(query_layer, key_layer, value_layer, attention_mask) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "C:\Users\OV\anaconda3\envs\bigdl\lib\site-packages\bigdl\llm\transformers\models\chatglm2.py", line 369, in core_attn_forward_8eb45c context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, RuntimeError: could not create a primitive

Dec 27 '23 02:12 openvino-book

Sorry, on our Windows A770 machines, A770 are all the default xpu device, so we cannot reproduce this error.

You can change 'xpu:1' back to 'xpu' and add optimize_mode=False in from_pretrained to run it on iGPU

Or you can change 'xpu:1' back to 'xpu', and try set ONEAPI_DEVICE_SELECTOR=level_zero:1 before running this example, set ONEAPI_DEVICE_SELECTOR=level_zero:1 should set A770 to default device.

Dec 27 '23 07:12 MeouSker77

Sorry, on our Windows A770 machines, A770 are all the default xpu device, so we cannot reproduce this error.

You can change 'xpu:1' back to 'xpu' and add optimize_mode=False in from_pretrained to run it on iGPU

Or you can change 'xpu:1' back to 'xpu', and try set ONEAPI_DEVICE_SELECTOR=level_zero:1 before running this example, set ONEAPI_DEVICE_SELECTOR=level_zero:1 should set A770 to default device.

Maybe we should test on laptop because A770M is GPU for Laptop. I'll try if I could reproduce this error on a Laptop.

Dec 27 '23 08:12 JinBridger

Sorry, on our Windows A770 machines, A770 are all the default xpu device, so we cannot reproduce this error. You can change 'xpu:1' back to 'xpu' and add optimize_mode=False in from_pretrained to run it on iGPU Or you can change 'xpu:1' back to 'xpu', and try set ONEAPI_DEVICE_SELECTOR=level_zero:1 before running this example, set ONEAPI_DEVICE_SELECTOR=level_zero:1 should set A770 to default device.

Maybe we should test on laptop because A770M is GPU for Laptop. I'll try if I could reproduce this error on a Laptop.

My machine is the NUC12 蝰蛇峡谷（Serpent Canyon） i7 12700H+Arc A770M

I change 'xpu:1' back to 'xpu' and set ONEAPI_DEVICE_SELECTOR=level_zero:1 -- It works!!! Thank you very much!!! 1703727325631

run the code

import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch

CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"

# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)
# run the optimized model on Intel GPU
model = model.to('xpu')

# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)
# 制作ChatGLM3格式提示词    
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")

# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
st = time.time()
# 执行推理计算，生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)

1703727425889

Dec 28 '23 01:12 openvino-book

@JinBridger Could I ask one more question? I want to run chatglm3-6b on A770 by streamlit model = model.to("xpu") can be added in the get_model(), How do I add the input_ids = input_ids.to('xpu') ?

The complete code is attached below:

import streamlit as st
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch

# 设置页面标题、图标和布局
st.set_page_config(
    page_title="ChatGLM3-6B+BigDL-LLM演示",
    page_icon=":robot:",
    layout="wide"
)
# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

@st.cache_resource
def get_model():
    # 载入ChatGLM3-6B模型并实现INT4量化
    model = AutoModel.from_pretrained(model_path,
                                    load_in_4bit=True,
                                    trust_remote_code=True)
    model = model.to('xpu')
    # 载入tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                            trust_remote_code=True)
    return tokenizer, model

# 加载Chatglm3的model和tokenizer
tokenizer, model = get_model()

# 初始化历史记录和past key values
if "history" not in st.session_state:
    st.session_state.history = []
if "past_key_values" not in st.session_state:
    st.session_state.past_key_values = None

# 设置max_length、top_p和temperature
max_length = st.sidebar.slider("max_length", 0, 32768, 8192, step=1)
top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
temperature = st.sidebar.slider("temperature", 0.0, 1.0, 0.6, step=0.01)

# 清理会话历史
buttonClean = st.sidebar.button("清理会话历史", key="clean")
if buttonClean:
    st.session_state.history = []
    st.session_state.past_key_values = None
    st.rerun()

# 渲染聊天历史记录
for i, message in enumerate(st.session_state.history):
    if message["role"] == "user":
        with st.chat_message(name="user", avatar="user"):
            st.markdown(message["content"])
    else:
        with st.chat_message(name="assistant", avatar="assistant"):
            st.markdown(message["content"])

# 输入框和输出框
with st.chat_message(name="user", avatar="user"):
    input_placeholder = st.empty()
with st.chat_message(name="assistant", avatar="assistant"):
    message_placeholder = st.empty()

# 获取用户输入
prompt_text = st.chat_input("请输入您的问题")

# 如果用户输入了内容,则生成回复
if prompt_text:

    input_placeholder.markdown(prompt_text)
    history = st.session_state.history
    past_key_values = st.session_state.past_key_values
    for response, history, past_key_values in model.stream_chat(
        tokenizer,
        prompt_text,
        history,
        past_key_values=past_key_values,
        max_length=max_length,
        top_p=top_p,
        temperature=temperature,
        return_past_key_values=True,
    ):
        message_placeholder.markdown(response)

    # 更新历史记录和past key values
    st.session_state.history = history
    st.session_state.past_key_values = past_key_values

Dec 28 '23 01:12 openvino-book

don't worry, the stream_chat API will move input tokens to model's device automatically (here), so you just need to move model to xpu

Dec 28 '23 01:12 MeouSker77

don't worry, the stream_chat API will move input tokens to model's device automatically (here), so you just need to move model to xpu

Yes!! Thank you very much for guidance! It works!!! 1703814340971

Tested sample code

import streamlit as st
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch

# 设置页面标题、图标和布局
st.set_page_config(
    page_title="ChatGLM3-6B+BigDL-LLM演示",
    page_icon=":robot:",
    layout="wide"
)
# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"

@st.cache_resource
def get_model():
    # 载入ChatGLM3-6B模型并实现INT4量化
    model = AutoModel.from_pretrained(model_path,
                                    load_in_4bit=True,
                                    trust_remote_code=True)
    model = model.to('xpu')
    # 载入tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_path,
                                            trust_remote_code=True)
    return tokenizer, model

# 加载Chatglm3的model和tokenizer
tokenizer, model = get_model()

# 初始化历史记录和past key values
if "history" not in st.session_state:
    st.session_state.history = []
if "past_key_values" not in st.session_state:
    st.session_state.past_key_values = None

# 设置max_length、top_p和temperature
max_length = st.sidebar.slider("max_length", 0, 32768, 8192, step=1)
top_p = st.sidebar.slider("top_p", 0.0, 1.0, 0.8, step=0.01)
temperature = st.sidebar.slider("temperature", 0.0, 1.0, 0.6, step=0.01)

# 清理会话历史
buttonClean = st.sidebar.button("清理会话历史", key="clean")
if buttonClean:
    st.session_state.history = []
    st.session_state.past_key_values = None
    st.rerun()

# 渲染聊天历史记录
for i, message in enumerate(st.session_state.history):
    if message["role"] == "user":
        with st.chat_message(name="user", avatar="user"):
            st.markdown(message["content"])
    else:
        with st.chat_message(name="assistant", avatar="assistant"):
            st.markdown(message["content"])

# 输入框和输出框
with st.chat_message(name="user", avatar="user"):
    input_placeholder = st.empty()
with st.chat_message(name="assistant", avatar="assistant"):
    message_placeholder = st.empty()

# 获取用户输入
prompt_text = st.chat_input("请输入您的问题")

# 如果用户输入了内容,则生成回复
if prompt_text:

    input_placeholder.markdown(prompt_text)
    history = st.session_state.history
    past_key_values = st.session_state.past_key_values
    for response, history, past_key_values in model.stream_chat(
        tokenizer,
        prompt_text,
        history,
        past_key_values=past_key_values,
        max_length=max_length,
        top_p=top_p,
        temperature=temperature,
        return_past_key_values=True,
    ):
        message_placeholder.markdown(response)

    # 更新历史记录和past key values
    st.session_state.history = history
    st.session_state.past_key_values = past_key_values

1703814512915

Dec 29 '23 01:12 openvino-book

三步完成在英特尔独立显卡上量化和部署 ChatGLM3-6B 模型

Jan 11 '24 23:01 openvino-book