Update building for Android
This PR includes:
- Updates to the docs for building Android on Termux
- Updates to the docs for cross-compiling for Android
- Changes to CMake configuration specific to Android
All changes have been tested (at least on aarch64 arm64-v8a) on both:
- Termux on Android
adb shellon Android
Caveat: If -c is not provided, the default context can end up over-initializing memory and killing the app (Termux) or crashing the system (adb shell). Since this would require a potentially lower-level fix which affects a wider scope, I have separated the issue into #9671.
Thanks.
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [x] Low
- [ ] Medium
- [ ] High
Since we don't get much reports for llama.cpp on Android, I'll use the opportunity to ask if you (or anyone else) have tried to run the Vulkan backend on Android devices? Wondering if the Vulkan backend is already capable of utilizing the mobile GPU or if more work is needed there. Any feedback in that regard is appreciated.
This PR probably resolves https://github.com/ggerganov/llama.cpp/issues/8705
AFAIU from the comment of @jsamol , libllama.so can be compiled with vulkan support. We can then use it inside an android app via JNI binding.
Unfortunately, I cannot speak to Vulkan on Android. It could be a matter of proper configuration to get it working, but I will refrain from more speculation. If I manage to come up with answers, I will report.
Unfortunately, I cannot speak to Vulkan on Android. It could be a matter of proper configuration to get it working, but I will refrain from more speculation. If I manage to come up with answers, I will report.
I successfully compiled llama.android with Vulkan support on Android, but performance was much worse than running on CPU. If I loaded more than 2 layers onto GPU, it would OOM.
One of the CI workflows is still failing: CI / windows-latest-cmake-sycl (pull_request)
Unfortunately, I cannot speak to Vulkan on Android. It could be a matter of proper configuration to get it working, but I will refrain from more speculation. If I manage to come up with answers, I will report.
I successfully compiled llama.android with Vulkan support on Android, but performance was much worse than running on CPU. If I loaded more than 2 layers onto GPU, it would OOM.
Notably, Q4_0_4_4 provided significantly better performance than any GPU build I've tested on Snapdragon devices running Android, even the prompt processing was better than using CLBlast (that already showed some gain over CPU).
One of the CI workflows is still failing:
CI / windows-latest-cmake-sycl (pull_request)
Yes, I saw that. Have been unsure why.
Seems the CMake change is making SYCL attempt to link m.dll again.
Investigating...
I think I figured it out. Fixing.
I think it's safe to bump the Android API level to 31 at this point : https://apilevels.com/
The following build works with no additional changes with Android NDK r26b (prev LTS) and r27b releases.
cmake -D CMAKE_TOOLCHAIN_FILE="$NDK/build/cmake/android.toolchain.cmake" -D ANDROID_ABI="arm64-v8a" -D ANDROID_PLATFORM="android-31" -D CMAKE_C_FLAGS="-march=armv8.7a" -D CMAKE_CXX_FLAGS="-march=armv8.7a" -G Ninja -B build-android-arm64
...
cmake --build build-android-arm64
Those CFLAGS should be good for all Android ARM64 devices from 2023/24 and enable Q4_0_4_8 support which is the most performant on the current gen CPUs.
NDK r26 and newer definitely includes OpenMP
cmake-command-above
...
-- Found OpenMP_C: -fopenmp=libomp
-- Found OpenMP_CXX: -fopenmp=libomp
-- Found OpenMP: TRUE
-- OpenMP found
...
However, our threadpool implementation is more efficient at this point so it makes sense to include GGML_OPENMP=OFF.
In other words, I don't think the CMakeFile.txt changes are needed, we should just update the README to recommend NDK r27b and API Level 31.
Hi, @max-krasnyansky --
I think whichever direction is chosen depends on what kind of (best-effort) support llama.cpp is intended to have for Android, either towards broader device support (with some kind of cut-off) or towards the most powerful and latest. I don't have a strong opinion about that.
As far as the CMakeLists.txt changes specifically, those have to do with linking subtleties re: Bionic; see https://developer.android.com/ndk/guides/stable_apis#c_library.
I'm happy to adjust the README to reflect any recommendations required to steer users of the project.
I think it's safe to bump the Android API level to 31 at this point : https://apilevels.com/
66.5% coverage seems low for the kind of hardware that we usually support, eg. we have builds for x86 for processors without AVX, which was introduced in 2011. Older phones are perfectly capable of running small LLMs.
I think it's safe to bump the Android API level to 31 at this point : https://apilevels.com/
66.5% coverage seems low for the kind of hardware that we usually support, eg. we have builds for x86 for processors without AVX, which was introduced in 2011. Older phones are perfectly capable of running small LLMs.
That data is a couple of years old but fair point. We could do API Level 28 which is sufficient to expose all the APIs we're using.
This builds/works just as well (tested with NDK r26b and r27b, on Galaxy S24).
cmake -D CMAKE_TOOLCHAIN_FILE="$NDK/build/cmake/android.toolchain.cmake" -D ANDROID_ABI="arm64-v8a" -D ANDROID_PLATFORM="android-28" -D CMAKE_C_FLAGS="-march=armv8.7a" -D CMAKE_CXX_FLAGS="-march=armv8.7a" -G Ninja -D GGML_OPENMP=OFF -B build-android-arm64
Hi, @max-krasnyansky --
I think whichever direction is chosen depends on what kind of (best-effort) support
llama.cppis intended to have for Android, either towards broader device support (with some kind of cut-off) or towards the most powerful and latest. I don't have a strong opinion about that.
We recently merged PR for runtime detection of the CPU capabilities. So it makes sense to enable all latest CPU features at build time and let the CPU backend check what's available at runtime.
As far as the
CMakeLists.txtchanges specifically, those have to do with linking subtleties re: Bionic; see https://developer.android.com/ndk/guides/stable_apis#c_library.
The CMake command I provided already links in everything we need. Here it is again:
cmake -D CMAKE_TOOLCHAIN_FILE="$NDK/build/cmake/android.toolchain.cmake" -D ANDROID_ABI="arm64-v8a" -D ANDROID_PLATFORM="android-28" -D CMAKE_C_FLAGS="-march=armv8.7a" -D CMAKE_CXX_FLAGS="-march=armv8.7a" -G Ninja -D GGML_OPENMP=OFF -B build-android-arm64
If you run verbose build you'll see that it's linking libm explicitly
cmake --build build-android-arm64 --verbose
...
/home/maxk/src/android-ndk-r27b/toolchains/llvm/prebuilt/linux-x86_64/bin/clang++ --target=aarch64-none-linux-android28 --sysroot=/home/maxk/src/android-ndk-r27b/toolchains/llvm/prebuilt/linux-x86_64/sysroot -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -march=armv8.7a -O3 -DNDEBUG -static-libstdc++ -Wl,--build-id=sha1 -Wl,--no-rosegment -Wl,--no-undefined-version -Wl,--fatal-warnings -Wl,--no-undefined -Qunused-arguments -Wl,--gc-sections examples/server/CMakeFiles/llama-server.dir/server.cpp.o -o bin/llama-server common/libcommon.a -pthread src/libllama.so ggml/src/libggml.so -pthread -latomic -lm
libdl is not being linked explicitly but the linker is happy (ie no link errors) so that symbol must be getting resolved, and resulting binaries run on the device without errors.
Ah. Ok. GGML_STATIC=ON is busted on Android without linking libdl (missing dladdr).
So the change for libm is not needed (it's already linked properly), and libdl change looks good.
I am incorporating and considering the discussion thus far and will make edits.
We recently merged PR for runtime detection of the CPU capabilities.
I saw that. If indeed the CPU features are detected at runtime, then I can see why we would include all the features during build (though I don't necessarily understand all the intricacies of -march).
-lm
I appreciate seeing that output. Re-reading the Bionic and Android NDK documentation, I have the understanding that libm "is automatically linked by the build systems," in which case, explicitly linking m privately the way llama.cpp does right now is at least redundant. If we are truly averse to that logic in the CMake file, I will change it.
I'd say let's remote the libm change. No need for redundancy.
I'd also update the CMake command with:
ANDROID_PLATFORM=android-28CMAKE_C_FLAGS="-march=armv8.7a"CMAKE_CXX_FLAGS="-march=armv8.7a"
(btw ideally we should just add cmake preset for this, not a blocker just a thought)
The rest looks good to me.
I will remove the libm change; though, to be clear, master currently links the lib explicitly, which is what I meant.
btw ideally we should just add cmake preset for this
I agree. There is already CMake logic for Android -march, though when I started this PR, I didn't have a good idea about the direction to take with it. I will leave that out of this PR.
I have completed changes reflecting this discussion and have tested the builds myself using Android NDK r25c, r26b, and r27b.
One thing I will note is that, some time in the last week, the tokens/s performance (using llama-simple) in adb shell has dropped by about half. Very striking, and no change related to this PR seems to have made a difference either way (NDK, API, -march).
One thing I will note is that, some time in the last week, the tokens/s performance (using llama-simple) in adb shell has dropped by about half.
Can you pinpoint the commit that introduced the regression? What is the exact llama-simple command that you use?
I have checked out many commits (all before mine) and have yet to pinpoint it. It could be something else.
The exact command I have been using is LD_LIBRARY_PATH=android/lib ./android/bin/llama-simple -m Q2_K-Meta-Llama-3.1-8B-Instruct.gguf -c 4096.
FYI, I tried this on Termux (both w/ and w/o my commits) and I did not observe the same regression. Something weird is afoot. If I discover something, I will report.
Thanks. I think we are good to merge this. Agree?
as a heads-up armv8.7a will not work with older devices i.e. Pixel 6 pro devices (3 year's old device, 2021), even though these devices are running recent Android versions (Android 14)
as a heads-up armv8.7a will not work with older devices i.e. Pixel 6 pro devices (3 year's old device, 2021), even though these devices are running recent Android versions (Android 14)
It should because of the runtime detection of the CPU capabilities (such as MATMUL_INT8, etc). In theory, it's possible that the compiler will use one of those instructions elsewhere but it's quite unlikely.
Can you try the latest on Pixel 6 Pro? Let us know if it doesn't work and we'll iterate further if needed.
Thanks, all. @dcale, thanks for your report. Please follow up with more info so we can continue to clarify any issues.
as a heads-up armv8.7a will not work with older devices i.e. Pixel 6 pro devices (3 year's old device, 2021), even though these devices are running recent Android versions (Android 14)
It should because of the runtime detection of the CPU capabilities (such as MATMUL_INT8, etc). In theory, it's possible that the compiler will use one of those instructions elsewhere but it's quite unlikely.
Can you try the latest on Pixel 6 Pro? Let us know if it doesn't work and we'll iterate further if needed.
I'll do and report back.
@max-krasnyansky @amqdn Running llama compiled with the -march=armv8.7a flag results in SIGILL (ILL_ILLOPC) on the Pixel 4a (2020). To be fair, the old target, armv8.4a+dotprod, wasn't much better. Only targeting armv8.2-a, which I believe is the highest compatible version for that model, makes it work.
@jsamol Thanks for the report. I will defer to the others about what to do here.
Since we don't get much reports for
llama.cppon Android, I'll use the opportunity to ask if you (or anyone else) have tried to run the Vulkan backend on Android devices? Wondering if the Vulkan backend is already capable of utilizing the mobile GPU or if more work is needed there. Any feedback in that regard is appreciated.
I did try it. Built a host for shader generation and got the correct Vulkan HPP headers (they're missing in NDK >26 and incomplete in NDK ≤26). Compilation works fine but performance is 2x worse than CPU-only. It also seems to use more RAM even though logs show the same amount with/without Vulkan
Clearly some optimizations are needed here