ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

qwen1.8B GPU memory usage is too high

Open juan-OY opened this issue 1 year ago • 5 comments

https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary

使用如上模型在显卡A770上运行,得到如下数据: 32in32 out peak GPU mem:3.1G 2048in512 out peak GPU mem:7.4G 4096in1024 out peak GPU mem:11.6G 8192in2048 out peak GPU mem: OOM

这个和官网INT4 的memory占用差距很大,是否能优化。 env: Linux 22.05 kernel5.19 OneAPI 2024.0 bigdl 2.5.0b20231218 ipex :2.1.10+xpu 复现方式: 转换Qwen模型到Low bit int4 使用benchmark脚本运行不同的输入测试性能。 部分测试代码: with torch.inference_mode(): torch.xpu.synchronize() prompt = QWEN_PROMPT_FORMAT.format(prompt=prompt) input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') torch.xpu.synchronize() # ipex model needs a warmup, then inference time can be accurate output = model.generate(input_ids, max_new_tokens=args.n_predict)

for i in range(5):
        st = time.time()
        torch.xpu.synchronize()
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')
        output = model.generate(input_ids, do_sample=False, max_new_tokens=args.n_predict)
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        torch.xpu.synchronize()
        end = time.time()
        print(f"cost {end - st:.4f}s")
        print(output_str)

juan-OY avatar Dec 29 '23 03:12 juan-OY

Hi, https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary in this page seems the memory is for 1 token in and 2048/8192 token out.

We will reproduce this result and update our results here.

hkvision avatar Jan 02 '24 09:01 hkvision

We used https://qianwen-res.oss-cn-beijing.aliyuncs.com/profile.py to reproduce the results.

Here are the results on nvidia's gpu: the memory usage reported by torch.cuda.max_memory_allocated matches the official report. But the memory usage reported by nvidia-smi is a little larger.

model device in-out torch.cuda.max_memory_allocated nvidia-smi
Qwen-1_8B-Chat-Int4 RTX4090 1-2048 2.91 GB 3.62 GB

Here are the results on intel's A770 using bigdl-llm: the memory usage is a little larger than the official report, but it is reasonable. We are continuing to optimize qwen’s memory footprint.

Model Device in-out torch.xpu.max_memory_allocated xpu-smi
Qwen-1_8B-Chat Arc 770 1-2048 3.34GB 4.01GB

Ricky-Ting avatar Feb 07 '24 01:02 Ricky-Ting

For longer sequence input (1k/2k/4k,...), bigdl-llm uses larger memory than the official model. We will look into this.

hkvision avatar Feb 09 '24 03:02 hkvision

thanks for the update.

juan-OY avatar Feb 17 '24 11:02 juan-OY

Hi, sorry for the late reply. One difference is that for the official int4 model, it is using w4a16, but previously when you run with ipex-llm, we are using w4a32, so you need to add model = model.half() after loading the model before putting it on xpu. We have optimized our memory usage and compared with RTX4090, the memory usage is reasonable/compatible now. Please have a check with the latest ipex-llm :)

hkvision avatar Apr 01 '24 10:04 hkvision