Is it available to accelerate llama2 with NPU?
I'm curious about the memory usage and practical performance of accelerating larger on-device LLMs like 7B with an NPU.And I noticed that mllm support llama2-7B on NPU, but I didn't find related files about llama2 like main_qwen_npu.cpp.
Could you tell me how to run llama2-7B like your test in Fast On-device LLM Inference with NPUs?
cc @oreomaker @liang1232018
I completed the NPU inference part of the Llama2 - 7B model according to the publicly available code. However, when I tested it on a 24GB device, it still crashed due to insufficient memory. You can also give it a try.
I completed the NPU inference part of the Llama2 - 7B model according to the publicly available code. However, when I tested it on a 24GB device, it still crashed due to insufficient memory. You can also give it a try.
Ok, thank you!