llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

will this project support on device npu like qualcomm hexagon?

Open AndreaChiChengdu opened this issue 1 year ago • 22 comments

I am very interested in mobile side deployment and would like to see if there is an opportunity to use mobile NPU/GPU in android devices for acceleration? thanks~

AndreaChiChengdu avatar Aug 21 '23 06:08 AndreaChiChengdu

Running computation with Android NN API requires a compute backend for Android NN API. It's possible to devote some efforts to develop such a backend if enough interest in adoption can be found in the community for different usage scenarios. Please note that, however, this can take some time.

monatis avatar Aug 21 '23 10:08 monatis

Running computation with Android NN API requires a compute backend for Android NN API. It's possible to devote some efforts to develop such a backend if enough interest in adoption can be found in the community for different usage scenarios. Please note that, however, this can take some time.

Qualcomm announced LLama 2 is coming to Snapdragon in 2024 and I highly suspect LLMs will become an integral part of the smartphone experience soon. Personally, I would be super excited to run such models mobile whereever I go and without the need for cellular connection.

So yes I think the interest is going to grow bigger and bigger in the upcoming months. Right now, its not really feasible especially as prompt processing is not yet accelerated by the GPU or NPU properly. So I would like to see that change. There's a lot of potential to tap in with the Hexagon and other ML accelerrators in these modern phones.

Dampfinchen avatar Aug 21 '23 14:08 Dampfinchen

@ggerganov Would it be of interest to introduce Android NN API as a new backend?

To be on the same page:

  • Android NN API is a C library that provides ML op primitives that can be delegated to NPU, TPU, GPU or CPU on Android.
  • It provides scalar and tensor data types for integers and floats in 32, 16 and 8 bits.
  • You define a DAG with references to the buffers of tensors and then schedule the inference with one of the free power-speed tradeoff levels.

If we decide to do so, a possible plan of attack might be:

  1. Implement inference a GGUF model with pure Android NN API.
  2. If it's promising compared to running directly on CPU, strart implementing it as a compute backend in GGML.

monatis avatar Aug 21 '23 15:08 monatis

It's definitely of interest. It has to be implemented as a new backend in llama.cpp, similar to CUDA, Metal, OpenCL, etc. The ggml library has to remain backend agnostic.

Best option would be if the Android API allows implementation of custom kernels, so that we can leverage the quantization formats that we currently have. Otherwise, there might not be a good enough argument for integrating the backend with ggml - one could just straight up implement the neural network with the Android building blocks

ggerganov avatar Aug 21 '23 18:08 ggerganov

It's definitely of interest.

Great. I'll give it a test drive this week.

Best option would be if the Android API allows implementation of custom kernels,

Its support for custom kernels may be limited and we may need to dequantize to 8-bit beforehand, but I'll dig more into this. If not we may still make use of it for 8/16/32-bit tensors but go to lower libraries for k-quants. Let me give it a try and see what's possible.

monatis avatar Aug 21 '23 20:08 monatis

@monatis did anything interesting come up?

BarfingLemurs avatar Sep 06 '23 10:09 BarfingLemurs

@BarfingLemurs I digged into NPU specs, but it turned out to be that NPU's support for custom ops is limited. It supports a set of pre-defined quantization / dequantization ops and 8-bit / 16-bit tensors. Of course there might be workarounds such as dequantizing tensors before offloading, but I doubt that it'll give a performance boost due to becoming I/O-bound this way. We can run 8-bit GGML models there, but even 8B models are too big in 8-bit precision for most smartphones, I think. GPNPU seems to be a better option to support custom kernels required for GGML's real benefits, but I'm not aware of devices that come with a GPNPU yet. Until then, mobile GPUs seem to be our best bet on Android. With that said, I still want to play with it after some higher-priority work to see what's possible.

monatis avatar Sep 06 '23 11:09 monatis

Hey @monatis, thanks for the shout-out.

The good news is that our programming model is C++. so looking through how other GPU backends are supported here in ggml, it seems straightforward to enable support for GPNPU in ggml. We support 8W8A quantization in our currently release version of architecture -- looking at 4W8A and others in the next version.

We're open to getting GGML contributors access to our C++ SDK and figuring out ways to get support into GGML.

dfiru avatar Sep 29 '23 17:09 dfiru

Hey @dfiru thanks for reaching out --great to have you here from Quadric!

I believe that GPNPU's approach is the right one, and I'm volunteer and definitely willing to explore possbilities and contribute to the implementation.

Should I contact you in PM or something to move this further?

monatis avatar Sep 29 '23 17:09 monatis

Would like to mention I'm still very much interested in an Android NN API, for generic android gpus, as clblast on android doesnt see any performance benefit for single batch or parallel decoding task. (Which would be useful for increasing t/s on potential medusa model implementation)

BarfingLemurs avatar Sep 29 '23 18:09 BarfingLemurs

@monatis no dms on gh :/

dm me on twitter (attached to my gh profile) and we can figure something out

dfiru avatar Sep 30 '23 15:09 dfiru

@monatis You mentioned Android GPU - what are the options to program for mobile GPUs? Vulkan?

ggerganov avatar Oct 02 '23 10:10 ggerganov

Yes, Vulkan is the recommended approach https://developer.android.com/ndk/guides/graphics/getting-started

Apparently, the Nomic team for gpt4all implemented a Vulkan backend, but I'm not sure about the compatibility of their custom license.

monatis avatar Oct 02 '23 11:10 monatis

Would anything change in the implementation to make use of the TPU in pixel devices?

nivibilla avatar Oct 05 '23 15:10 nivibilla

Would using https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk/getting-started suffice for devices with 8cx gen3 and x elite next year? I' m more interested in Windows on arm support than android.

xgdgsc avatar Nov 12 '23 01:11 xgdgsc

Coming in here late, and not a lot of experience in either, yet, but a lot ( most? ) of ARM CPUs now have NPUs, as do RISC-V. its not just phone devices anymore but embedded devices and mid-range desktops. Their power is increasing with each release. Also there are the google Coral style TPUs, which are accessible to us mere mortals.

ghost avatar Nov 23 '23 19:11 ghost

For SD8 gen 3 the claim is to inference a 7B Llama 2 "based" model at 20 tokens/sec. As there will be naive support for INT4 I would assume 4b quantisation. I assume this is the NPU https://docs.qualcomm.com/bundle/publicresource/87-71408-1_REV_B_Snapdragon_8_gen_3_Mobile_Platform_Product_Brief.pdf

dimopep avatar Nov 30 '23 18:11 dimopep

also support for Intel NPU and AMD XNDA2 that are coming in new processors, from 2024 all consumer pcs will have a powefull NPU capable of 50TOPS as dictated per Windows 12 and windows will offload many AI task to this NPU.

agonzalezm avatar Dec 11 '23 14:12 agonzalezm

The SD8 performance metrics demistified

https://www.qualcomm.com/news/onq/2023/11/accelerating-generative-ai-at-the-edge

"We reduced the memory bandwidth through knowledge distillation, quantization-aware training, and speculative decoding...We use quantization-aware training with knowledge distillation to address these challenges and achieve an accurate and smaller INT4 model"

dimopep avatar Dec 11 '23 19:12 dimopep

Are there any new updates to this discussion currently?

shifeiwen avatar Mar 05 '24 03:03 shifeiwen

Google recently published a guide and a blog about the new experimental MediaPipe LLM Inference API:

They also have a code example: mediapipe/examples/llm_inference

EwoutH avatar Apr 19 '24 11:04 EwoutH

Google recently published a guide and a blog about the new experimental MediaPipe LLM Inference API:

They also have a code example: mediapipe/examples/llm_inference

It seems like this is restricted to some handpicked models for some reason. I wonder if it is possible to expand this selection without Google's help.

scarlettekk avatar Apr 21 '24 18:04 scarlettekk

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Jun 06 '24 01:06 github-actions[bot]

So, It's hard to use Qualcomm hexagon? Am I right?

Yemaoxin avatar Jul 18 '24 12:07 Yemaoxin

Can we reopen this issue? With Hexagon NPUs finding their way to laptops and desktops, it’s only going to be more relevant.

EwoutH avatar Jul 20 '24 08:07 EwoutH

Yes, I think this is a vital feature.

Yemaoxin avatar Jul 26 '24 03:07 Yemaoxin

Of note: NNAPI will be deprecated in Android 15 https://developer.android.com/ndk/guides/neuralnetworks/migration-guide

scarlettekk avatar Aug 09 '24 22:08 scarlettekk

In order to accelerate llama.cpp on Qualcomm, do we need to implement ggml parts to use 'Qualcomm neural processing SDK API' or 'Haxagon SDK API'?

sparkleholic avatar Sep 04 '24 09:09 sparkleholic