llama.cpp Update building for Android

This PR includes:

Updates to the docs for building Android on Termux
Updates to the docs for cross-compiling for Android
Changes to CMake configuration specific to Android

All changes have been tested (at least on aarch64 arm64-v8a) on both:

Termux on Android
adb shell on Android

Caveat: If -c is not provided, the default context can end up over-initializing memory and killing the app (Termux) or crashing the system (adb shell). Since this would require a potentially lower-level fix which affects a wider scope, I have separated the issue into #9671.

Thanks.

[x] I have read the contributing guidelines
Self-reported review complexity:
- [x] Low
- [ ] Medium
- [ ] High

Sep 27 '24 22:09 amqdn

Since we don't get much reports for llama.cpp on Android, I'll use the opportunity to ask if you (or anyone else) have tried to run the Vulkan backend on Android devices? Wondering if the Vulkan backend is already capable of utilizing the mobile GPU or if more work is needed there. Any feedback in that regard is appreciated.

Sep 28 '24 10:09 ggerganov

This PR probably resolves https://github.com/ggerganov/llama.cpp/issues/8705

AFAIU from the comment of @jsamol , libllama.so can be compiled with vulkan support. We can then use it inside an android app via JNI binding.

Sep 28 '24 14:09 ngxson

Unfortunately, I cannot speak to Vulkan on Android. It could be a matter of proper configuration to get it working, but I will refrain from more speculation. If I manage to come up with answers, I will report.

Sep 28 '24 18:09 amqdn

Unfortunately, I cannot speak to Vulkan on Android. It could be a matter of proper configuration to get it working, but I will refrain from more speculation. If I manage to come up with answers, I will report.

I successfully compiled llama.android with Vulkan support on Android, but performance was much worse than running on CPU. If I loaded more than 2 layers onto GPU, it would OOM.

Oct 01 '24 03:10 AndrewNLauder

One of the CI workflows is still failing: CI / windows-latest-cmake-sycl (pull_request)

Oct 01 '24 06:10 ggerganov

Unfortunately, I cannot speak to Vulkan on Android. It could be a matter of proper configuration to get it working, but I will refrain from more speculation. If I manage to come up with answers, I will report.

I successfully compiled llama.android with Vulkan support on Android, but performance was much worse than running on CPU. If I loaded more than 2 layers onto GPU, it would OOM.

Notably, Q4_0_4_4 provided significantly better performance than any GPU build I've tested on Snapdragon devices running Android, even the prompt processing was better than using CLBlast (that already showed some gain over CPU).

Oct 01 '24 12:10 gustrd

One of the CI workflows is still failing: CI / windows-latest-cmake-sycl (pull_request)

Yes, I saw that. Have been unsure why.

Seems the CMake change is making SYCL attempt to link m.dll again.

Investigating...

Oct 01 '24 17:10 amqdn

I think I figured it out. Fixing.

Oct 01 '24 17:10 amqdn

I think it's safe to bump the Android API level to 31 at this point : https://apilevels.com/

The following build works with no additional changes with Android NDK r26b (prev LTS) and r27b releases.

cmake -D CMAKE_TOOLCHAIN_FILE="$NDK/build/cmake/android.toolchain.cmake" -D ANDROID_ABI="arm64-v8a" -D ANDROID_PLATFORM="android-31" -D CMAKE_C_FLAGS="-march=armv8.7a" -D CMAKE_CXX_FLAGS="-march=armv8.7a" -G Ninja -B build-android-arm64
...
cmake --build build-android-arm64

Those CFLAGS should be good for all Android ARM64 devices from 2023/24 and enable Q4_0_4_8 support which is the most performant on the current gen CPUs.

NDK r26 and newer definitely includes OpenMP

cmake-command-above
...
-- Found OpenMP_C: -fopenmp=libomp
-- Found OpenMP_CXX: -fopenmp=libomp
-- Found OpenMP: TRUE
-- OpenMP found
...

However, our threadpool implementation is more efficient at this point so it makes sense to include GGML_OPENMP=OFF.

In other words, I don't think the CMakeFile.txt changes are needed, we should just update the README to recommend NDK r27b and API Level 31.

Oct 02 '24 22:10 max-krasnyansky

Hi, @max-krasnyansky --

I think whichever direction is chosen depends on what kind of (best-effort) support llama.cpp is intended to have for Android, either towards broader device support (with some kind of cut-off) or towards the most powerful and latest. I don't have a strong opinion about that.

As far as the CMakeLists.txt changes specifically, those have to do with linking subtleties re: Bionic; see https://developer.android.com/ndk/guides/stable_apis#c_library.

I'm happy to adjust the README to reflect any recommendations required to steer users of the project.

Oct 03 '24 00:10 amqdn

I think it's safe to bump the Android API level to 31 at this point : https://apilevels.com/

66.5% coverage seems low for the kind of hardware that we usually support, eg. we have builds for x86 for processors without AVX, which was introduced in 2011. Older phones are perfectly capable of running small LLMs.

Oct 03 '24 00:10 slaren

I think it's safe to bump the Android API level to 31 at this point : https://apilevels.com/

66.5% coverage seems low for the kind of hardware that we usually support, eg. we have builds for x86 for processors without AVX, which was introduced in 2011. Older phones are perfectly capable of running small LLMs.

That data is a couple of years old but fair point. We could do API Level 28 which is sufficient to expose all the APIs we're using.

This builds/works just as well (tested with NDK r26b and r27b, on Galaxy S24).

cmake -D CMAKE_TOOLCHAIN_FILE="$NDK/build/cmake/android.toolchain.cmake" -D ANDROID_ABI="arm64-v8a" -D ANDROID_PLATFORM="android-28" -D CMAKE_C_FLAGS="-march=armv8.7a" -D CMAKE_CXX_FLAGS="-march=armv8.7a" -G Ninja -D GGML_OPENMP=OFF -B build-android-arm64

Oct 03 '24 01:10 max-krasnyansky

Hi, @max-krasnyansky --

I think whichever direction is chosen depends on what kind of (best-effort) support llama.cpp is intended to have for Android, either towards broader device support (with some kind of cut-off) or towards the most powerful and latest. I don't have a strong opinion about that.

We recently merged PR for runtime detection of the CPU capabilities. So it makes sense to enable all latest CPU features at build time and let the CPU backend check what's available at runtime.

As far as the CMakeLists.txt changes specifically, those have to do with linking subtleties re: Bionic; see https://developer.android.com/ndk/guides/stable_apis#c_library.

The CMake command I provided already links in everything we need. Here it is again:

cmake -D CMAKE_TOOLCHAIN_FILE="$NDK/build/cmake/android.toolchain.cmake" -D ANDROID_ABI="arm64-v8a" -D ANDROID_PLATFORM="android-28" -D CMAKE_C_FLAGS="-march=armv8.7a" -D CMAKE_CXX_FLAGS="-march=armv8.7a" -G Ninja -D GGML_OPENMP=OFF -B build-android-arm64

If you run verbose build you'll see that it's linking libm explicitly

cmake --build build-android-arm64 --verbose
...
/home/maxk/src/android-ndk-r27b/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++ --target=aarch64-none-linux-android28 --sysroot=/home/maxk/src/android-ndk-r27b/toolchains/llvm/prebuilt/linux-x86_64/sysroot -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security  -march=armv8.7a -O3 -DNDEBUG -static-libstdc++ -Wl,--build-id=sha1 -Wl,--no-rosegment -Wl,--no-undefined-version -Wl,--fatal-warnings -Wl,--no-undefined -Qunused-arguments   -Wl,--gc-sections examples/server/CMakeFiles/llama-server.dir/server.cpp.o -o bin/llama-server  common/libcommon.a  -pthread  src/libllama.so  ggml/src/libggml.so  -pthread  -latomic -lm

libdl is not being linked explicitly but the linker is happy (ie no link errors) so that symbol must be getting resolved, and resulting binaries run on the device without errors.

Oct 03 '24 02:10 max-krasnyansky

Ah. Ok. GGML_STATIC=ON is busted on Android without linking libdl (missing dladdr). So the change for libm is not needed (it's already linked properly), and libdl change looks good.

Oct 03 '24 02:10 max-krasnyansky

I am incorporating and considering the discussion thus far and will make edits.

We recently merged PR for runtime detection of the CPU capabilities.

I saw that. If indeed the CPU features are detected at runtime, then I can see why we would include all the features during build (though I don't necessarily understand all the intricacies of -march).

-lm

I appreciate seeing that output. Re-reading the Bionic and Android NDK documentation, I have the understanding that libm "is automatically linked by the build systems," in which case, explicitly linking m privately the way llama.cpp does right now is at least redundant. If we are truly averse to that logic in the CMake file, I will change it.

Oct 03 '24 18:10 amqdn

I'd say let's remote the libm change. No need for redundancy.

I'd also update the CMake command with:

ANDROID_PLATFORM=android-28
CMAKE_C_FLAGS="-march=armv8.7a"
CMAKE_CXX_FLAGS="-march=armv8.7a"

(btw ideally we should just add cmake preset for this, not a blocker just a thought)

The rest looks good to me.

Oct 03 '24 21:10 max-krasnyansky

I will remove the libm change; though, to be clear, master currently links the lib explicitly, which is what I meant.

btw ideally we should just add cmake preset for this

I agree. There is already CMake logic for Android -march, though when I started this PR, I didn't have a good idea about the direction to take with it. I will leave that out of this PR.

Oct 03 '24 23:10 amqdn

I have completed changes reflecting this discussion and have tested the builds myself using Android NDK r25c, r26b, and r27b.

One thing I will note is that, some time in the last week, the tokens/s performance (using llama-simple) in adb shell has dropped by about half. Very striking, and no change related to this PR seems to have made a difference either way (NDK, API, -march).

Oct 05 '24 19:10 amqdn

One thing I will note is that, some time in the last week, the tokens/s performance (using llama-simple) in adb shell has dropped by about half.

Can you pinpoint the commit that introduced the regression? What is the exact llama-simple command that you use?

Oct 06 '24 11:10 ggerganov

I have checked out many commits (all before mine) and have yet to pinpoint it. It could be something else.

The exact command I have been using is LD_LIBRARY_PATH=android/lib ./android/bin/llama-simple -m Q2_K-Meta-Llama-3.1-8B-Instruct.gguf -c 4096.

Oct 07 '24 01:10 amqdn

FYI, I tried this on Termux (both w/ and w/o my commits) and I did not observe the same regression. Something weird is afoot. If I discover something, I will report.

Oct 07 '24 04:10 amqdn

Thanks. I think we are good to merge this. Agree?

Oct 07 '24 08:10 ggerganov

as a heads-up armv8.7a will not work with older devices i.e. Pixel 6 pro devices (3 year's old device, 2021), even though these devices are running recent Android versions (Android 14)

Oct 07 '24 09:10 dcale

as a heads-up armv8.7a will not work with older devices i.e. Pixel 6 pro devices (3 year's old device, 2021), even though these devices are running recent Android versions (Android 14)

It should because of the runtime detection of the CPU capabilities (such as MATMUL_INT8, etc). In theory, it's possible that the compiler will use one of those instructions elsewhere but it's quite unlikely.

Can you try the latest on Pixel 6 Pro? Let us know if it doesn't work and we'll iterate further if needed.

Oct 07 '24 16:10 max-krasnyansky

Thanks, all. @dcale, thanks for your report. Please follow up with more info so we can continue to clarify any issues.

Oct 07 '24 17:10 amqdn

as a heads-up armv8.7a will not work with older devices i.e. Pixel 6 pro devices (3 year's old device, 2021), even though these devices are running recent Android versions (Android 14)

It should because of the runtime detection of the CPU capabilities (such as MATMUL_INT8, etc). In theory, it's possible that the compiler will use one of those instructions elsewhere but it's quite unlikely.

Can you try the latest on Pixel 6 Pro? Let us know if it doesn't work and we'll iterate further if needed.

I'll do and report back.

Oct 08 '24 06:10 dcale

@max-krasnyansky @amqdn Running llama compiled with the -march=armv8.7a flag results in SIGILL (ILL_ILLOPC) on the Pixel 4a (2020). To be fair, the old target, armv8.4a+dotprod, wasn't much better. Only targeting armv8.2-a, which I believe is the highest compatible version for that model, makes it work.

Nov 06 '24 13:11 jsamol

@jsamol Thanks for the report. I will defer to the others about what to do here.

Nov 06 '24 18:11 amqdn

Since we don't get much reports for llama.cpp on Android, I'll use the opportunity to ask if you (or anyone else) have tried to run the Vulkan backend on Android devices? Wondering if the Vulkan backend is already capable of utilizing the mobile GPU or if more work is needed there. Any feedback in that regard is appreciated.

I did try it. Built a host for shader generation and got the correct Vulkan HPP headers (they're missing in NDK >26 and incomplete in NDK ≤26). Compilation works fine but performance is 2x worse than CPU-only. It also seems to use more RAM even though logs show the same amount with/without Vulkan

Clearly some optimizations are needed here

Nov 16 '24 12:11 rmatif