mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

[Feature Request] run the LLM model on the Qualcomm Hexagon NPU in Android OS

Open taeyeonlee opened this issue 1 year ago • 17 comments

🚀 Feature

Hello, Is it possible to run the LLM model (Llama 2 7B Quantized) on the Qualcomm Hexagon NPU in Android OS ? How to run the LLM model on the Qualcomm Hexagon NPU in Android OS ?

Motivation

Qualcomm says that Qualcomm Hexagon NPU performance is up to 98% faster

Alternatives

Additional context

taeyeonlee avatar Jan 31 '24 02:01 taeyeonlee

I tried a bit but failed since Hexagon is not OPEN for developers. To be specific:

  1. 32-bit RTOS with 4GB memory limitation. (Qualcomm can use tricks to support more memory, but we cannot)
  2. No public HMX API and Docs
  3. No optimization docs for HVX.

We can leave this issue open, but it would be hard to support.

Hzfengsy avatar Jan 31 '24 06:01 Hzfengsy

@Hzfengsy I'll ask to Qualcomm for the info.

taeyeonlee avatar Feb 01 '24 04:02 taeyeonlee

@Hzfengsy @taeyeonlee Would you consider indirect support through Android NNAPI instead of low level API support(android NNAPI will automatically switch between CPU, GPU, and NPU.It is possible that many optimization methods cannot be used.)?.At present, the resources of mobile phone devices are limited and it is necessary to consider full utilization(Run multiple models on the mobile phone, such as LLM, ASR, TTS, etc.).

ningpengtao-coder avatar Feb 04 '24 02:02 ningpengtao-coder

@ningpengtao-coder Thanks for your suggestion. That's a good approach to running models on Android. However, I (as well as the team) do not have extra bandwidth to support NNAPI in TVM and MLC-LLM. Love to see the power of community in this interesting area :)

Hzfengsy avatar Feb 04 '24 15:02 Hzfengsy

@Hzfengsy im a liitle bit confused, cause TVM does have Hexagon backend codegen, and mlc-llm is based on TVM Unity. So why mlc-llm cannot lowering to hexagon target codes? Is there anything unsupported on the way of "Relax-->TIR-->hexagon target codes" ?

FdyCN avatar Feb 05 '24 02:02 FdyCN

@FdyCN The problem seems to be that htp backend has many limitations, including the size of memory requested and the speed of memory. However, Qualcomm has promoted in some videos that it can make the 7B model reach 20tok/s. I have made some attempts to run single-layer transformers on qnn htp backend and the time exceeds 100ms. I don’t know how Qualcomm achieved it because in mobile It is a good result to use htp 20/toks on the end because it can properly liberate the gpu

shifeiwen avatar Feb 05 '24 09:02 shifeiwen

I have tried to implement 1.1b llama in hexagon backend before and it was very slow because I did not use cpu scheduling and only added hvx compilation instructions when compiling llvm, but I think this compilation instruction did not play a role in codegen.

shifeiwen avatar Feb 05 '24 09:02 shifeiwen

I have tried to implement 1.1b llama in hexagon backend before and it was very slow because I did not use cpu scheduling and only added hvx compilation instructions when compiling llvm, but I think this compilation instruction did not play a role in codegen.

@shifeiwen thank you so much for the reply, i haven't tested TVM Hexagon codegen performance. According to your experiment, it seems that hvx auto-tuning cannot get the high performance kernel? So mlc-llm on HVX-only backend can work but slow?Am i right?

FdyCN avatar Feb 05 '24 11:02 FdyCN

This is something ideally we would like to enable, and indeed we need to overcome some of the hurdles mentions. We can keep this issue open to see the status, getting things into a runnable state is a good first step

tqchen avatar Feb 05 '24 15:02 tqchen

@FdyCN Yes, there are currently some ways to support mlc running in hexagon backend, but I tested it very slowly. Each token of 1.1b llama takes more than 60s (there is a lot of optimization work that I did not use, such as better CPU scheduling, or It is the real use of hvx features) ps: 1.1btinyllama load model takes 10 minutes, and the memory speed of dsp is very slow. I wanted to use some shared memory methods, but it was not completed.

shifeiwen avatar Feb 06 '24 03:02 shifeiwen

@FdyCN Yes, there are currently some ways to support mlc running in hexagon backend, but I tested it very slowly. Each token of 1.1b llama takes more than 60s (there is a lot of optimization work that I did not use, such as better CPU scheduling, or It is the real use of hvx features) ps: 1.1btinyllama load model takes 10 minutes, and the memory speed of dsp is very slow. I wanted to use some shared memory methods, but it was not completed.

@shifeiwen thank you for the reply,your test results are really helpful for me, i think maybe deploy llm model on HVX through TVM is not the best choice currently.
Could you please share your optimization test later? Really appreciate!

FdyCN avatar Feb 06 '24 04:02 FdyCN

Did the whole situation of public API and Docs change? What about the Neural Processing SDK/AI Engine Direct SDK (they are actually the same SDK download now)

https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk and https://developer.qualcomm.com/software/qualcomm-ai-engine-direct-sdk

hmartinez82 avatar Apr 28 '24 23:04 hmartinez82

wish+1008611

MrRace avatar May 24 '24 11:05 MrRace

anyone interested in doing it, look at https://developer.qualcomm.com/downloads/halide-hvx-training?referrer=node/6116

pro9code avatar Jun 02 '24 13:06 pro9code

This is a good request. If anyone has a better way to use the NPU for LLM inference. Please give me some ideas.

Yemaoxin avatar Jul 18 '24 12:07 Yemaoxin

In case this is of interest, we provide an example for deploying TinyLlaMA-1.1B-Chat on Qualcomm Hexagon NPU (SM8650): https://github.com/saic-fi/MobileQuant/tree/main/capp. However, our solution is pretty ad-hoc compared to MLC-LLM.

fwtan avatar Sep 04 '24 14:09 fwtan