candle
candle copied to clipboard
Any tips to speed up quantized Whisper inference on Android?
Hello, running q80 quantized whisper on Android (Pixel 7) is taking around 15 seconds for 5 seconds of audio. Is there any way to speed this up that I might not be aware of or is it just because candle isn't as optimized as something like whisper.cpp yet? whisper.cpp took around 3 seconds or less if I remember correctly. Although this was with a q40 model. Thanks.
I'm not super familiar with how cross-compiling to android works, and what simd instructions/blas library could be available on these platforms. Most likely the compiled binary is not benefiting from these and this would explain the slowness (on normal builds and wasm, we have some specific build setup and code to hopefully use simd instructions).
I just added a CANDLE_DEQUANTIZE_ALL
environment variable that will force using the standard matmul rather than the quantized one, could you try running your tests with this set to 1
just in case?
Wow, enabling that resulted in it taking only around 4.5 seconds!
Interesting, thanks for reporting this back. There are multiple things at play here and I'll have to dig a bit deeper to understand what is going on.
- It could be that the quantized matmul doesn't detect the simd instructions but the normal matmul does (unlikely).
- Using Q8_0 is slower than Q4_0, it's supposed to be optimized but maybe we've missed something.
- The quantized matmul isn't as smart as the unquantized one when in comes to cache locality. In GPT like architectures, it's usually not an issue but the whisper encoding step might not have the same properties.
Btw I was a bit off on the amount of time, I wasn't actually testing with 5 second audio but more like 2 seconds with 3 seconds of no sound. Also the way I was recording it was broken. But I just tried with it fixed and its still about the same difference, just saying in case you try and get slower than 4.5 seconds
Q4_0 is even slower (without CANDLE_DEQUANTIZE_ALL set to 1) so it can't be that. Is there anything more I can do to try and figure out the cause of the problem? I could provide the source code of my app if needed because I'm going to open source it anyways.
@soupslurpr can you share how you have made it running?
When I build cargo it always fails with such as errors:
error: instruction requires: fullfp16
error: could not compile gemm-f16
(lib) due to 11 previous errors
@rbrus are you sure you are using the Android NDK to compile?
For example, in .cargo/config.toml I specified:
[target.aarch64-linux-android]
ar = "C:/Users/user/AppData/Local/Android/Sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/windows-x86_64/bin/llvm-ar.exe"
linker = "C:/Users/user/AppData/Local/Android/Sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/windows-x86_64/bin/aarch64-linux-android21-clang.cmd"
Thanks @soupslurpr . This seems to be working ,it just fails to build it. I am wondering if there is some issue with fp16 build for Android?
@rbrus what version of candle?
Also, are you building for the aarch64-linux-android target?
And note that the config I provided probably needs to be changed for your Windows username
Yes, and it always fails with the same error:
`error: instruction requires: fullfp16
--> /home/sus/.cargo/registry/src/index.crates.io-6f17d22bba15001f/gemm-common-0.17.0/src/simd.rs:1940:18
|
1940 | "fmul {0:v}.8h, {1:v}.8h, {2:v}.8h",
| ^
|
note: instantiated into assembly here
-->
error: could not compile gemm-f16
(lib) due to 11 previous errors
warning: build failed, waiting for other jobs to finish...`
I run it on Ubuntu 22.04 upgraded recently to 23.10. The version of rust and cargo and candle are most recent. I have setup it all today.
So you changed the config.toml to go to the NDK you have downloaded?
Yes, exactly, to NDK 25.2.9519653.
You haven't had such as issue?
No, maybe try NDK 26.1.10909125?
Also perhaps try adding this to the [target.aarch64-linux-android]
rustflags = [ "-C", "target-feature=+fp16,+neon", ]
I think I needed this before and had the same error as you, but it isn't needed for me anymore.
@soupslurpr after changing the NDK it built! Huh, thanks for help!
By any chance, do you have a project which inference with the rlib?
@rbrus great!
I do have a project I'm working on for running whisper speech to text on Android using Candle, but I'm not working on it currently as the speed is still too slow.
@LaurentMazare idk if this helps but have you seen https://developer.android.com/ndk/guides/cpu-arm-neon
Does candle use neon?
Edit: looks like it does. Maybe https://developer.android.com/ndk/guides/neuralnetworks can be implemented as it seems to be for machine learning libraries and accelerates them?
@soupslurpr how did you build candle with target as android I am getting openssl error. I am on mac and have setup the environment variable like below:
export AR="/Users/akashsingh/.NDK/arm64/bin/llvm-ar" export CC="/Users/akashsingh/.NDK/arm64/bin/aarch64-linux-android-clang"
@singhaki I don't compile OpenSSL because it's a pain. I think there is a feature to disable that or actually might be in the hf_hub crate to disable networking.
Can you provide me steps to compile for android I am trying to run phi-2 on a old redmi note 7 pro android device ? We can put it in this discussion if anyone else would be interested? https://github.com/huggingface/candle/discussions/2081
Just tested and this issue actually happens with x86_64 too. Tested on Windows and the quantized Whisper is way slower (5x when measured using hyperfine, 10 seconds vs 2 seconds) than the unquantized / the one using the CANDLE_DEQUANTIZE_ALL set to 1 so it isn't Android specific.
@singhaki you need to download openssl source, and set OPENSSL_DIR and OPENSSL_LIB_DIR to it to compile. Or at least, that's how I've done it.