executorch icon indicating copy to clipboard operation
executorch copied to clipboard

Does llama2 example on Android utilize HTP?

Open CHNtentes opened this issue 1 year ago • 7 comments

https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md#performance

AFAIK, Qualcomm Socs have HTP to boost AI performance. Are there performance numbers using HTP or just GPU?

CHNtentes avatar May 11 '24 08:05 CHNtentes

We can run smaller version of Llama structure in HTP, and WIP for the 7B and 8B models in HTP. cc @cccclai. Running Llama in mobile GPU is also WIP cc @SS-JIA

iseeyuan avatar May 13 '24 14:05 iseeyuan

We can run smaller version of Llama structure in HTP, and WIP for the 7B and 8B models in HTP. cc @cccclai. Running Llama in mobile GPU is also WIP cc @SS-JIA

I'm a little confused. If HTP & GPU are both not fully implemented now, what's executorch using, just CPU?

CHNtentes avatar May 14 '24 01:05 CHNtentes

The perf number shared in https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md#performance is purely CPU, using XNNPACK backend.

For lowering to QCOM HTP, it's still ongoing and we only have the enablement for the small stories models right now. We're actively working on enable llama2 and improve performance number.

cccclai avatar May 14 '24 05:05 cccclai

Well, thanks for your reply. Actually it makes sense since Qualcomm themselves do not have an open approach to run LLM on htp.

CHNtentes avatar May 14 '24 06:05 CHNtentes

@CHNtentes Are you looking for running llama2 model via HTP or GPU?

cccclai avatar May 14 '24 16:05 cccclai

We can run smaller version of Llama structure in HTP, and WIP for the 7B and 8B models in HTP. cc @cccclai. Running Llama in mobile GPU is also WIP cc @SS-JIA

I'm a little confused. If HTP & GPU are both not fully implemented now, what's executorch using, just CPU?

You can just follow the steps to execute the llama2 on HTP. But the whole graph of llama2 is not running on HTP, running on CPU partly if some Ops/subgraph is not supported on HTP.

czy2014hust avatar May 15 '24 02:05 czy2014hust

@CHNtentes Are you looking for running llama2 model via HTP or GPU?

It would be better if I can run llama on htp. I suppose the preformance and power consumption is better than cpu/gpu.

CHNtentes avatar May 15 '24 07:05 CHNtentes

In case this is of interest, we provide an example for deploying TinyLlaMA-1.1B-Chat on HTP (SM8650): https://github.com/saic-fi/MobileQuant/tree/main/capp. However, the solution is pretty ad-hoc compared to executorch.

fwtan avatar Aug 28 '24 00:08 fwtan