executorch Does llama2 example on Android utilize HTP?

https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md#performance

AFAIK, Qualcomm Socs have HTP to boost AI performance. Are there performance numbers using HTP or just GPU?

May 11 '24 08:05 CHNtentes

We can run smaller version of Llama structure in HTP, and WIP for the 7B and 8B models in HTP. cc @cccclai. Running Llama in mobile GPU is also WIP cc @SS-JIA

May 13 '24 14:05 iseeyuan

We can run smaller version of Llama structure in HTP, and WIP for the 7B and 8B models in HTP. cc @cccclai. Running Llama in mobile GPU is also WIP cc @SS-JIA

I'm a little confused. If HTP & GPU are both not fully implemented now, what's executorch using, just CPU?

May 14 '24 01:05 CHNtentes

The perf number shared in https://github.com/pytorch/executorch/blob/main/examples/models/llama2/README.md#performance is purely CPU, using XNNPACK backend.

For lowering to QCOM HTP, it's still ongoing and we only have the enablement for the small stories models right now. We're actively working on enable llama2 and improve performance number.

May 14 '24 05:05 cccclai

Well, thanks for your reply. Actually it makes sense since Qualcomm themselves do not have an open approach to run LLM on htp.

May 14 '24 06:05 CHNtentes

@CHNtentes Are you looking for running llama2 model via HTP or GPU?

May 14 '24 16:05 cccclai

We can run smaller version of Llama structure in HTP, and WIP for the 7B and 8B models in HTP. cc @cccclai. Running Llama in mobile GPU is also WIP cc @SS-JIA

I'm a little confused. If HTP & GPU are both not fully implemented now, what's executorch using, just CPU?

You can just follow the steps to execute the llama2 on HTP. But the whole graph of llama2 is not running on HTP, running on CPU partly if some Ops/subgraph is not supported on HTP.

May 15 '24 02:05 czy2014hust

@CHNtentes Are you looking for running llama2 model via HTP or GPU?

It would be better if I can run llama on htp. I suppose the preformance and power consumption is better than cpu/gpu.

May 15 '24 07:05 CHNtentes

In case this is of interest, we provide an example for deploying TinyLlaMA-1.1B-Chat on HTP (SM8650): https://github.com/saic-fi/MobileQuant/tree/main/capp. However, the solution is pretty ad-hoc compared to executorch.

Aug 28 '24 00:08 fwtan

executorch executorch copied to clipboard

Does llama2 example on Android utilize HTP?

executorch
executorch copied to clipboard