Is it available to accelerate llama2 with NPU?

Open Msyu1020 opened this issue 5 months ago • 3 comments

I'm curious about the memory usage and practical performance of accelerating larger on-device LLMs like 7B with an NPU.And I noticed that mllm support llama2-7B on NPU, but I didn't find related files about llama2 like main_qwen_npu.cpp.

Could you tell me how to run llama2-7B like your test in Fast On-device LLM Inference with NPUs?

Jul 10 '25 12:07 Msyu1020

cc @oreomaker @liang1232018

Jul 22 '25 13:07 chenghuaWang

I completed the NPU inference part of the Llama2 - 7B model according to the publicly available code. However, when I tested it on a 24GB device, it still crashed due to insufficient memory. You can also give it a try.

Jul 25 '25 09:07 yangyyj

I completed the NPU inference part of the Llama2 - 7B model according to the publicly available code. However, when I tested it on a 24GB device, it still crashed due to insufficient memory. You can also give it a try.

Ok, thank you!

Jul 26 '25 11:07 Msyu1020