ipex-llm qwen1.8B GPU memory usage is too high

https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary

使用如上模型在显卡A770上运行，得到如下数据： 32in32 out peak GPU mem:3.1G 2048in512 out peak GPU mem:7.4G 4096in1024 out peak GPU mem:11.6G 8192in2048 out peak GPU mem: OOM

这个和官网INT4 的memory占用差距很大，是否能优化。 env： Linux 22.05 kernel5.19 OneAPI 2024.0 bigdl 2.5.0b20231218 ipex :2.1.10+xpu 复现方式：转换Qwen模型到Low bit int4 使用benchmark脚本运行不同的输入测试性能。部分测试代码： with torch.inference_mode(): torch.xpu.synchronize() prompt = QWEN_PROMPT_FORMAT.format(prompt=prompt) input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') torch.xpu.synchronize() # ipex model needs a warmup, then inference time can be accurate output = model.generate(input_ids, max_new_tokens=args.n_predict)

for i in range(5):
        st = time.time()
        torch.xpu.synchronize()
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, do_sample=False, max_new_tokens=args.n_predict)
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        torch.xpu.synchronize()
        end = time.time()
        print(f"cost {end - st:.4f}s")
        print(output_str)

Dec 29 '23 03:12 juan-OY

Hi, https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary in this page seems the memory is for 1 token in and 2048/8192 token out.

We will reproduce this result and update our results here.

Jan 02 '24 09:01 hkvision

We used https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py to reproduce the results.

Here are the results on nvidia's gpu: the memory usage reported by torch.cuda.max_memory_allocated matches the official report. But the memory usage reported by nvidia-smi is a little larger.

model	device	in-out	`torch.cuda.max_memory_allocated`	`nvidia-smi`
Qwen-1_8B-Chat-Int4	RTX4090	1-2048	2.91 GB	3.62 GB

Here are the results on intel's A770 using bigdl-llm: the memory usage is a little larger than the official report, but it is reasonable. We are continuing to optimize qwen’s memory footprint.

Model	Device	in-out	`torch.xpu.max_memory_allocated`	`xpu-smi`
Qwen-1_8B-Chat	Arc 770	1-2048	3.34GB	4.01GB

Feb 07 '24 01:02 Ricky-Ting

For longer sequence input (1k/2k/4k,...), bigdl-llm uses larger memory than the official model. We will look into this.

Feb 09 '24 03:02 hkvision

thanks for the update.

Feb 17 '24 11:02 juan-OY

Hi, sorry for the late reply. One difference is that for the official int4 model, it is using w4a16, but previously when you run with ipex-llm, we are using w4a32, so you need to add model = model.half() after loading the model before putting it on xpu. We have optimized our memory usage and compared with RTX4090, the memory usage is reasonable/compatible now. Please have a check with the latest ipex-llm :)

Apr 01 '24 10:04 hkvision

ipex-llm ipex-llm copied to clipboard

qwen1.8B GPU memory usage is too high

ipex-llm
ipex-llm copied to clipboard