mlc-llm [Feature Request] Do you have any plan to support CPU backend on Android devices?

🚀 Feature

I know there is OpenCL backend on Android platform. But for many Android Devices, the GPU has been used by other subsystem like display. So we need to use CPU to run LLM.

Oct 23 '23 02:10 xukui1203

As of now our focus has been on GPU and possibly NPU.

CPU can in theory be supported as TVM have cpu backends, so we also welcome contributions to try that direction

Oct 24 '23 13:10 tqchen

Thanks for your reply. I tried it , the cpu backends can work on Android platform. But it is really very slow. Now we are trying to use ACL to see whether it can get a better performance.

Oct 25 '23 03:10 xukui1203

What is the perf that your are getting on TVM CPU & TVM GPU backend. If you Arm Compute Library implementaion is ready, can you please share its perf as well.

Oct 28 '23 15:10 Nick-infinity

For vicuana 7B, it is about 8toks/s on GPU, but needs 50s to decode 1 toks on CPU. We still can not bringup Arm Compute Library.

Oct 30 '23 01:10 xukui1203

This is a great question indeed. Also, thanks for this wonderful repo.

Do you know where how I can choose between the accelerators being used, @tqchen ? I try to follow the code, but could not see where NPU or GPU are being selected as accelerators.

Oct 30 '23 16:10 FabianSchuetze

We are using GPUs on Android. CPUs, as indicated in this thread, are likely too slow to support an LLM meaningfully.

Oct 30 '23 17:10 junrushao

Thanks, @junrushao for the comment. What about other accelerators than GPUs? NPUs or DSPs come to mind.

Nov 05 '23 17:11 FabianSchuetze

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

Nov 06 '23 07:11 xukui1203

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

Apr 23 '24 06:04 liangzelang

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.

Apr 23 '24 09:04 xukui1203

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.

ok, thx.

Apr 25 '24 06:04 liangzelang

We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.

The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?

We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.

I have great interest in your research about CPU/DSP, Could you share test code, technical articles, etc.?

Jul 18 '24 02:07 junwenZhang

MLLM used the Hexagon NPU to achieve 1000 tokens/sec prefill. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj0tbS93bCHAxVNoK8BHUrnFoYQFnoECBEQAQ&url=https%3A%2F%2Fgithub.com%2FUbiquitousLearning%2Fmllm&usg=AOvVaw3aGyCoeqNrvgCRUeTxNNK2&opi=89978449. Maybe it can work.

Jul 18 '24 13:07 Yemaoxin

mlc-llm mlc-llm copied to clipboard

[Feature Request] Do you have any plan to support CPU backend on Android devices?

🚀 Feature

mlc-llm
mlc-llm copied to clipboard