ipex-llm
ipex-llm copied to clipboard
Running ChatGLM3-6B on A380 by BigDL,it suspends all the time
OS: Win10 22H2 19045.3803 Python=3.9 and install the env according to https://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html
Test code:
import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
import intel_extension_for_pytorch as ipex
import torch
CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"
# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"
# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True)
# run the optimized model on Intel GPU
model = model.to('xpu')
# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
# 制作ChatGLM3格式提示词
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")
# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
st = time.time()
# 执行推理计算,生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)
Run the python script by:
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
set SYCL_CACHE_PERSISTENT=1
python chatglm3_infer_gpu.py
the code suspends all the time as below:
when modify the code, and run it on CPU, it works! test code:
import time
from bigdl.llm.transformers import AutoModel
from transformers import AutoTokenizer
CHATGLM_V3_PROMPT_FORMAT = "<|user|>\n{prompt}\n<|assistant|>"
# 请指定chatglm3-6b的本地路径
model_path = "d:/chatglm3-6b"
# 载入ChatGLM3-6B模型并实现INT4量化
model = AutoModel.from_pretrained(model_path,
load_in_4bit=True,
trust_remote_code=True)
# 载入tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path,
trust_remote_code=True)
# 制作ChatGLM3格式提示词
prompt = CHATGLM_V3_PROMPT_FORMAT.format(prompt="What is Intel?")
# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
st = time.time()
# 执行推理计算,生成Tokens
output = model.generate(input_ids,max_new_tokens=32)
end = time.time()
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(f'Inference time: {end-st} s')
print('-'*20, 'Prompt', '-'*20)
print(prompt)
print('-'*20, 'Output', '-'*20)
print(output_str)
A380's 6GB memory is not enough to run chatglm3-6b now.
You can try to add a parameter cpu_embedding=True to AutoModel.from_pretrained, and try again on A380. Maybe you need to wait for about 10-20 minutes.
A380's 6GB memory is not enough to run chatglm3-6b now.
As the screenshoot, after loading the ChatGLM3-6B into A380's memory, it shows 4.4GB consumption, is A380's 6GB memory not enough? many low power dGPU like A380 ONLY has 6GB memory, it's important to support the low power dGPU on the edge, eg. LLM+Robot application
A380's 6GB memory is not enough to run chatglm3-6b now.
As the screenshoot, after loading the ChatGLM3-6B into A380's memory, it shows 4.4GB consumption, is A380's 6GB memory not enough? many low power dGPU like A380 ONLY has 6GB memory, it's important to support the low power dGPU on the edge, eg. LLM+Robot application
ChatGLM3-6B Run successfully on A380
The Test Platform is below
We has run successfully on our A380, too.
Please make sure you has set set SYCL_CACHE_PERSISTENT=1, otherwise the compiling will cost about 7 minutes every time. If you has set this env, you just need to compile once for the first time. The second run will be very fast.