mllm icon indicating copy to clipboard operation
mllm copied to clipboard

Is it available to accelerate llama2 with NPU?

Open Msyu1020 opened this issue 5 months ago • 3 comments

I'm curious about the memory usage and practical performance of accelerating larger on-device LLMs like 7B with an NPU.And I noticed that mllm support llama2-7B on NPU, but I didn't find related files about llama2 like main_qwen_npu.cpp.

Could you tell me how to run llama2-7B like your test in Fast On-device LLM Inference with NPUs?

Msyu1020 avatar Jul 10 '25 12:07 Msyu1020

cc @oreomaker @liang1232018

chenghuaWang avatar Jul 22 '25 13:07 chenghuaWang

I completed the NPU inference part of the Llama2 - 7B model according to the publicly available code. However, when I tested it on a 24GB device, it still crashed due to insufficient memory. You can also give it a try.

yangyyj avatar Jul 25 '25 09:07 yangyyj

I completed the NPU inference part of the Llama2 - 7B model according to the publicly available code. However, when I tested it on a 24GB device, it still crashed due to insufficient memory. You can also give it a try.

Ok, thank you!

Msyu1020 avatar Jul 26 '25 11:07 Msyu1020