mlc-llm
mlc-llm copied to clipboard
[Feature Request] Do you have any plan to support CPU backend on Android devices?
🚀 Feature
I know there is OpenCL backend on Android platform. But for many Android Devices, the GPU has been used by other subsystem like display. So we need to use CPU to run LLM.
As of now our focus has been on GPU and possibly NPU.
CPU can in theory be supported as TVM have cpu backends, so we also welcome contributions to try that direction
Thanks for your reply. I tried it , the cpu backends can work on Android platform. But it is really very slow. Now we are trying to use ACL to see whether it can get a better performance.
What is the perf that your are getting on TVM CPU & TVM GPU backend. If you Arm Compute Library implementaion is ready, can you please share its perf as well.
For vicuana 7B, it is about 8toks/s on GPU, but needs 50s to decode 1 toks on CPU. We still can not bringup Arm Compute Library.
This is a great question indeed. Also, thanks for this wonderful repo.
Do you know where how I can choose between the accelerators being used, @tqchen ? I try to follow the code, but could not see where NPU or GPU are being selected as accelerators.
We are using GPUs on Android. CPUs, as indicated in this thread, are likely too slow to support an LLM meaningfully.
Thanks, @junrushao for the comment. What about other accelerators than GPUs? NPUs or DSPs come to mind.
We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.
We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.
The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?
We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.
The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?
We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.
We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.
The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?
We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.
ok, thx.
We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it does not support fp16, I think if we can use HTP, it may has a better performance.
The test on Hexagon DSP you mentioned is used by MLC-LLM or Qualcomm tools (like QNN/SNPE)? And I have great interest in your test on ARM CPU, Could you share more details, such as test code, technical articles, etc.?
We used MLC-LLM to run LLM on Hexagon DSP. The performance is better now. It is easy to run LLM on CPU, but I am sorry I did not save the patch after the test.
I have great interest in your research about CPU/DSP, Could you share test code, technical articles, etc.?
MLLM used the Hexagon NPU to achieve 1000 tokens/sec prefill. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj0tbS93bCHAxVNoK8BHUrnFoYQFnoECBEQAQ&url=https%3A%2F%2Fgithub.com%2FUbiquitousLearning%2Fmllm&usg=AOvVaw3aGyCoeqNrvgCRUeTxNNK2&opi=89978449. Maybe it can work.