candle Any tips to speed up quantized Whisper inference on Android?

Hello, running q80 quantized whisper on Android (Pixel 7) is taking around 15 seconds for 5 seconds of audio. Is there any way to speed this up that I might not be aware of or is it just because candle isn't as optimized as something like whisper.cpp yet? whisper.cpp took around 3 seconds or less if I remember correctly. Although this was with a q40 model. Thanks.

Oct 06 '23 23:10 soupslurpr

I'm not super familiar with how cross-compiling to android works, and what simd instructions/blas library could be available on these platforms. Most likely the compiled binary is not benefiting from these and this would explain the slowness (on normal builds and wasm, we have some specific build setup and code to hopefully use simd instructions). I just added a CANDLE_DEQUANTIZE_ALL environment variable that will force using the standard matmul rather than the quantized one, could you try running your tests with this set to 1 just in case?

Oct 07 '23 09:10 LaurentMazare

Wow, enabling that resulted in it taking only around 4.5 seconds!

Oct 07 '23 10:10 soupslurpr

Interesting, thanks for reporting this back. There are multiple things at play here and I'll have to dig a bit deeper to understand what is going on.

It could be that the quantized matmul doesn't detect the simd instructions but the normal matmul does (unlikely).
Using Q8_0 is slower than Q4_0, it's supposed to be optimized but maybe we've missed something.
The quantized matmul isn't as smart as the unquantized one when in comes to cache locality. In GPT like architectures, it's usually not an issue but the whisper encoding step might not have the same properties.

Oct 07 '23 10:10 LaurentMazare

Btw I was a bit off on the amount of time, I wasn't actually testing with 5 second audio but more like 2 seconds with 3 seconds of no sound. Also the way I was recording it was broken. But I just tried with it fixed and its still about the same difference, just saying in case you try and get slower than 4.5 seconds

Oct 13 '23 06:10 soupslurpr

Q4_0 is even slower (without CANDLE_DEQUANTIZE_ALL set to 1) so it can't be that. Is there anything more I can do to try and figure out the cause of the problem? I could provide the source code of my app if needed because I'm going to open source it anyways.

Oct 21 '23 01:10 soupslurpr

@soupslurpr can you share how you have made it running?

When I build cargo it always fails with such as errors: error: instruction requires: fullfp16 error: could not compile gemm-f16 (lib) due to 11 previous errors

Jan 23 '24 04:01 rbrus

@rbrus are you sure you are using the Android NDK to compile?

For example, in .cargo/config.toml I specified:

[target.aarch64-linux-android]
ar = "C:/Users/user/AppData/Local/Android/Sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/windows-x86_64/bin/llvm-ar.exe"
linker = "C:/Users/user/AppData/Local/Android/Sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/windows-x86_64/bin/aarch64-linux-android21-clang.cmd"

Jan 23 '24 05:01 soupslurpr

Thanks @soupslurpr . This seems to be working ,it just fails to build it. I am wondering if there is some issue with fp16 build for Android?

Jan 23 '24 06:01 rbrus

@rbrus what version of candle?

Jan 23 '24 06:01 soupslurpr

Also, are you building for the aarch64-linux-android target?

Jan 23 '24 06:01 soupslurpr

And note that the config I provided probably needs to be changed for your Windows username

Jan 23 '24 06:01 soupslurpr

Yes, and it always fails with the same error:

`error: instruction requires: fullfp16 --> /home/sus/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gemm-common-0.17.0/src/simd.rs:1940:18 | 1940 | "fmul {0:v}.8h, {1:v}.8h, {2:v}.8h", | ^ | note: instantiated into assembly here --> :1:2 | 1 | fmul v0.8h, v1.8h, v2.8h | ^

error: could not compile gemm-f16 (lib) due to 11 previous errors warning: build failed, waiting for other jobs to finish...`

I run it on Ubuntu 22.04 upgraded recently to 23.10. The version of rust and cargo and candle are most recent. I have setup it all today.

Jan 23 '24 06:01 rbrus

So you changed the config.toml to go to the NDK you have downloaded?

Jan 23 '24 06:01 soupslurpr

Yes, exactly, to NDK 25.2.9519653.

You haven't had such as issue?

Jan 23 '24 07:01 rbrus

No, maybe try NDK 26.1.10909125?

Jan 23 '24 08:01 soupslurpr

Also perhaps try adding this to the [target.aarch64-linux-android]

rustflags = [ "-C", "target-feature=+fp16,+neon", ]

I think I needed this before and had the same error as you, but it isn't needed for me anymore.

Jan 23 '24 08:01 soupslurpr

@soupslurpr after changing the NDK it built! Huh, thanks for help!

By any chance, do you have a project which inference with the rlib?

Jan 23 '24 09:01 rbrus

@rbrus great!

I do have a project I'm working on for running whisper speech to text on Android using Candle, but I'm not working on it currently as the speed is still too slow.

Jan 23 '24 09:01 soupslurpr

@LaurentMazare idk if this helps but have you seen https://developer.android.com/ndk/guides/cpu-arm-neon

Does candle use neon?

Edit: looks like it does. Maybe https://developer.android.com/ndk/guides/neuralnetworks can be implemented as it seems to be for machine learning libraries and accelerates them?

Jan 23 '24 09:01 soupslurpr

@soupslurpr how did you build candle with target as android I am getting openssl error. I am on mac and have setup the environment variable like below:

export AR="/Users/akashsingh/.NDK/arm64/bin/llvm-ar" export CC="/Users/akashsingh/.NDK/arm64/bin/aarch64-linux-android-clang"

Apr 18 '24 07:04 akashicMarga

@singhaki I don't compile OpenSSL because it's a pain. I think there is a feature to disable that or actually might be in the hf_hub crate to disable networking.

Apr 18 '24 07:04 soupslurpr

Can you provide me steps to compile for android I am trying to run phi-2 on a old redmi note 7 pro android device ? We can put it in this discussion if anyone else would be interested? https://github.com/huggingface/candle/discussions/2081

Apr 18 '24 07:04 akashicMarga

Just tested and this issue actually happens with x86_64 too. Tested on Windows and the quantized Whisper is way slower (5x when measured using hyperfine, 10 seconds vs 2 seconds) than the unquantized / the one using the CANDLE_DEQUANTIZE_ALL set to 1 so it isn't Android specific.

Apr 27 '24 08:04 soupslurpr

@singhaki you need to download openssl source, and set OPENSSL_DIR and OPENSSL_LIB_DIR to it to compile. Or at least, that's how I've done it.

May 07 '24 13:05 bgergely0

candle candle copied to clipboard

Any tips to speed up quantized Whisper inference on Android?

candle
candle copied to clipboard