llama.cpp
llama.cpp copied to clipboard
android port of llama.cpp
@ggerganov , can we expect an android port like the whisper one?
With cmake, it's quite easy to get android binary
$ mkdir build
$ cd build
$ cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a -DANDROID_PLATFORM=android-30 \
-DCMAKE_C_FLAGS=-march=armv8.4a+dotprod ..
$ make
On recent flagship Android devices, run ./llama -m models/7B/ggml-model-q4_0.bin -t 4 -n 128
, you should get ~ 5 tokens/second.
# ./llama -m models/7B/ggml-model-q4_0.bin -t 4 -n 128 -p "The first man on the moon"
main: seed = 1678784568
llama_model_load: loading model from 'models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size = 512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from 'models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: prompt: 'The first man on the moon'
main: number of tokens in prompt = 7
1 -> ''
1576 -> 'The'
937 -> ' first'
767 -> ' man'
373 -> ' on'
278 -> ' the'
18786 -> ' moon'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
The first man on the moon Neil Armstrong dies at 82
Neil Amstrong, whose space mission made him an American hero and international icon for a generation of children who came to believe he was "the nicest man in history" (AP) [end of text]
main: mem per token = 14368644 bytes
main: load time = 3966.54 ms
main: sample time = 84.84 ms
main: predict time = 12131.24 ms / 220.57 ms per token
main: total time = 16974.94 ms
@freedomtan , i was talking about something with a simple UI, like interactive mode. Where you can input the main prompt "you are an assistant etc" and then start chatting.
See PR #130 on how to build and run with termux
@GeorvityLabs that is a full blown app not a port at that point.
@GeorvityLabs that is a full blown app not a port at that point.
Yes. Sort of like the one for whisper.cpp
@GeorvityLabs that is a full blown app not a port at that point.
I'll try to write up something
With cmake, it's quite easy to get android binary
Unless of course attempting to do so just gives a nice output full of errors, and every time you fix one, another appears... any suggestions?
We made a flutter app if it can help :)
https://github.com/Bip-Rep/sherpa
Have fun
Please share the bin :)
@NoNamedCat I have releases on my git where you can find the apk. https://github.com/Bip-Rep/sherpa/releases
On recent flagship Android devices, run
./llama -m models/7B/ggml-model-q4_0.bin -t 4 -n 128
, you should get ~ 5 tokens/second.
@freedomtan Before this step, how can I install llama
on an Android device? Is it as simple as copying a file named llama
from somewhere else to the Android device, and then run the ./llama
command?