ipex-llm
ipex-llm copied to clipboard
qwen1.8B GPU memory usage is too high
https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary
使用如上模型在显卡A770上运行,得到如下数据: 32in32 out peak GPU mem:3.1G 2048in512 out peak GPU mem:7.4G 4096in1024 out peak GPU mem:11.6G 8192in2048 out peak GPU mem: OOM
这个和官网INT4 的memory占用差距很大,是否能优化。 env: Linux 22.05 kernel5.19 OneAPI 2024.0 bigdl 2.5.0b20231218 ipex :2.1.10+xpu 复现方式: 转换Qwen模型到Low bit int4 使用benchmark脚本运行不同的输入测试性能。 部分测试代码: with torch.inference_mode(): torch.xpu.synchronize() prompt = QWEN_PROMPT_FORMAT.format(prompt=prompt) input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') torch.xpu.synchronize() # ipex model needs a warmup, then inference time can be accurate output = model.generate(input_ids, max_new_tokens=args.n_predict)
for i in range(5):
st = time.time()
torch.xpu.synchronize()
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
output = model.generate(input_ids, do_sample=False, max_new_tokens=args.n_predict)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
torch.xpu.synchronize()
end = time.time()
print(f"cost {end - st:.4f}s")
print(output_str)
Hi, https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary in this page seems the memory is for 1 token in and 2048/8192 token out.
We will reproduce this result and update our results here.
We used https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py to reproduce the results.
Here are the results on nvidia's gpu: the memory usage reported by torch.cuda.max_memory_allocated
matches the official report. But the memory usage reported by nvidia-smi
is a little larger.
model | device | in-out | torch.cuda.max_memory_allocated |
nvidia-smi |
---|---|---|---|---|
Qwen-1_8B-Chat-Int4 | RTX4090 | 1-2048 | 2.91 GB | 3.62 GB |
Here are the results on intel's A770 using bigdl-llm: the memory usage is a little larger than the official report, but it is reasonable. We are continuing to optimize qwen’s memory footprint.
Model | Device | in-out | torch.xpu.max_memory_allocated |
xpu-smi |
---|---|---|---|---|
Qwen-1_8B-Chat | Arc 770 | 1-2048 | 3.34GB | 4.01GB |
For longer sequence input (1k/2k/4k,...), bigdl-llm uses larger memory than the official model. We will look into this.
thanks for the update.
Hi, sorry for the late reply. One difference is that for the official int4 model, it is using w4a16, but previously when you run with ipex-llm, we are using w4a32, so you need to add model = model.half()
after loading the model before putting it on xpu.
We have optimized our memory usage and compared with RTX4090, the memory usage is reasonable/compatible now. Please have a check with the latest ipex-llm :)