mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

[Feature Request] Do you have any plan to support CPU backend on Android devices?

Open xukui1203 opened this issue 1 year ago • 13 comments

🚀 Feature

I know there is OpenCL backend on Android platform. But for many Android Devices, the GPU has been used by other subsystem like display. So we need to use CPU to run LLM.

xukui1203 avatar Oct 23 '23 02:10 xukui1203

As of now our focus has been on GPU and possibly NPU.

CPU can in theory be supported as TVM have cpu backends, so we also welcome contributions to try that direction

tqchen avatar Oct 24 '23 13:10 tqchen

Thanks for your reply. I tried it , the cpu backends can work on Android platform. But it is really very slow. Now we are trying to use ACL to see whether it can get a better performance.

xukui1203 avatar Oct 25 '23 03:10 xukui1203

What is the perf that your are getting on TVM CPU & TVM GPU backend. If you Arm Compute Library implementaion is ready, can you please share its perf as well.

Nick-infinity avatar Oct 28 '23 15:10 Nick-infinity

For vicuana 7B, it is about 8toks/s on GPU, but needs 50s to decode 1 toks on CPU. We still can not bringup Arm Compute Library.

xukui1203 avatar Oct 30 '23 01:10 xukui1203

This is a great question indeed. Also, thanks for this wonderful repo.

Do you know where how I can choose between the accelerators being used, @tqchen ? I try to follow the code, but could not see where NPU or GPU are being selected as accelerators.

FabianSchuetze avatar Oct 30 '23 16:10 FabianSchuetze

We are using GPUs on Android. CPUs, as indicated in this thread, are likely too slow to support an LLM meaningfully.

junrushao avatar Oct 30 '23 17:10 junrushao

Thanks, @junrushao for the comment. What about other accelerators than GPUs? NPUs or DSPs come to mind.

FabianSchuetze avatar Nov 05 '23 17:11 FabianSchuetze

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

xukui1203 avatar Nov 06 '23 07:11 xukui1203

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

liangzelang avatar Apr 23 '24 06:04 liangzelang

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.

xukui1203 avatar Apr 23 '24 09:04 xukui1203

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.

ok, thx.

liangzelang avatar Apr 25 '24 06:04 liangzelang

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.

I have great interest in your research about CPU/DSP, Could you share test code, technical articles, etc.?

junwenZhang avatar Jul 18 '24 02:07 junwenZhang

MLLM used the Hexagon NPU to achieve 1000 tokens/sec prefill. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj0tbS93bCHAxVNoK8BHUrnFoYQFnoECBEQAQ&url=https%3A%2F%2Fgithub.com%2FUbiquitousLearning%2Fmllm&usg=AOvVaw3aGyCoeqNrvgCRUeTxNNK2&opi=89978449. Maybe it can work.

Yemaoxin avatar Jul 18 '24 13:07 Yemaoxin