Master issue: LLM Benchmark
What can we do for the default backend?
- Model: Let's start with llama 3.2 1B, preferably quantized ones
- Runtime:
- MediaPipe and AI-Edge-Torch: our current default backend is tflite based, so if we can continue using TFLite-based solution, it would be good.
-
- new API in LiteRT (https://ai.google.dev/edge/litert)
- ExecuTorch: https://github.com/pytorch/executorch, https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md
- llama.cpp: https://github.com/ggerganov/llama.cpp
- onnx runtime: https://github.com/microsoft/onnxruntime, let's check if it works on Android platforms.
- onnx runtime genai: https://github.com/microsoft/onnxruntime-genai
I've looked into building each runtime for Android and here's what I found out:
TFLite
Google's warning about AI-Edge-Torch being experimental is quite the understatement. After much struggling I ended up getting the nightly versions of ai-edge-torch and ai-edge-quantizer which were released the the same day as the last edit to the llama example was made. That was the only way to get the thing to actually function.
Once it started it used up all 11GB my GPU had and promptly failed. Trying to get it to use the CPU ended with a C++ protobuf error.
So I gave up and downloaded pre-converted models from HuggingFace, and attempted to use them with MediaPipe. The android app compiled but ~would give a PERMISSION_DENIED error when attempting to load the model~.
Edit
I managed to get a model loaded by using /data/ directory instead of /sdcard/ seems like Android had some protection for internal storage. it worked but the model did not respond to my prompt properly, instead giving python examples...
Executorch
Executorch seemed to have the most support and robust examples behind it. ~But unfortunately I couldn't even install the library's dependencies because of a cmake error regarding gflags. I want to look further into it to see if I can set it up locally since it's quite promising.~
Edit
I was able to get Executorch working. The cmake error was resolved by removing my local gflags package, seems like it was conflicting with executorch's gflags in the venv.
Executorch has quite a bit of steps to get the .aar library compiled for android, which caused me trouble when using an NDK version lower than 25, their main guide for the llama example does mention that they support only version 27, but this isn't mentioned in the android demo app guide. I was also able to compile the library using NDK version 28.
Getting the models wasn't very straightforward either, since they needed to be converted from .pth to .pte. Thankfully executorch's converter worked without issue (unlike AI-Edge-Torch).
Running the app required upgrading gradle to 8.5 which thankfully Android Studio made quite straightforward, once that was done the app compiled and ran without issue.
I attempted to use the unquantized model first. Upon pressing load in the android app, my Galaxy S8 completely froze for 5 minutes until I restarted it. Possibly due to RAM capacity. Loading the quantized model did not cause any issues, and the model worked on my machine flawlessly.
This was done using the xnnpack backend, Executorch also supports Qualcomm and Mediatek AI engines which I did not test.
llama.cpp
working with llama.cpp has been the most straightforward, the app compiled and launched without issue, a couple of line changes allowed me to download whichever custome GGUF model I wanted, and the models actually functioned (They were however extremely slow on my Galaxy S8). There was an issue where manually loading the models into my phone caused them to not be readable, but was fixed by having the phone download the models instead of my PC.
Conclusion
My priority was TFLite because it's what we use. but it seems that both MediaPipe and AI-Edge-Torch are still in their infancy, and could cause problems when attempting to integrate them into the mobile app. It's a similar story with ExecuTorch but in a different direction, I like the documentation and support, though it is quite complicated to get running, especially if you stray from the guides. That leaves us with llama.cpp,models for which seem to be available. And while I've not attempted to convert my own (because HuggingFace has GGUF versions of the models), the models actually working without me having to dig through files upon files of code was a welcome change.
~I'll look further into installing Executorch for now, but until I can get it running,~ llama.cpp was the only runtime I could get to work without struggle. But Executorch's converter and android app (when they didn't crash the entire machine) were quite impressive. It's worth mentioning while executorch is built around pytorch and its models, llama.cpp supports quite a few LLMs other than llama.
For model conversion with AI-Edge-Torch, as I noted at https://github.com/google-ai-edge/ai-edge-torch/issues/269#issuecomment-2505304355
- you don't need GPU
- you need a lot of DRAM, which is supposed to be a bug / feature of current AI-Edge-Torch implementation, (I don't remember the amount needed, supposedly 64 GiB is enough).
- I used Colab to do it.
I tried getting AI-Edge-Torch to work again, this time using a docker container. The code seemed to run without issue until it killed itself, seemingly due to lack of RAM (my machine only has 32GB). But with this setup a local CPU can be used to convert the models. It's just extremely finicky and easily prone to breaking.
@Mostelk: there is a GPT2 Android tflite app, https://github.com/huggingface/tflite-android-transformers/tree/master/gpt2 @mohitmundhragithub: there is https://github.com/mlcommons/mlperf_client which runs llama 2 7B
@freedomtan to check the GPT2 and the https://github.com/mlcommons/mlperf_client.
check that input format of huggingface tflite models (supposedly, you can simple cast integer token ids to fp16, setting the input tensors, if the expected input is token ids rather than some embeddeding after feeding token ids into some gather ops).
check that input format of huggingface tflite models (supposedly, you can simple cast integer token ids to fp16, setting the input tensors, if the expected input is token ids rather than some embeddeding after feeding token ids into some gather ops).
@farook-edev it turns out it's a bit complicated. If you check the tflite models, you can find that their inputs and outputs are not simple. There are
- input token and input pos: int32, that is token id (from tokenizer) and token pos (0, 1, 2, ...)
- KV cache matrixes: float32
How to inspect the models:
- visualization tools such as https://ai.google.dev/edge/model-explorer
- couple lines of code using python api
import tensorflow as tf
import sys
interpreter = tf.lite.Interpreter(model_path='/tmp/foo.tflite')
print(interpreter.get_input_details())
I managed to get input into and output out of the converted model. I used the python version of TFLite but will be moving the code to C++ since the test on python was successful.
The process is as follows:
- Copy a single token id into the token tensor and set the pos (starting at 0)
- run inference
- increment pos by 1 and copy kv caches from output to input
- repeat until all input tokens are consumed
- get output logits and convert to token id
- feed output token id into input
- run inference
- repeat step 3
- repeat from 5 until a limit is hit or 128001 (end of text token) is found
One thing to note is that if you take a list of input tensors and output tensors, and remove any non kv cache tensors (token, pos, and logits), the remaining tensors are ordered in such a way that the output tensors map directly to the corresponding input tensors, for example: output[0] -> input[0] output[1] -> input[1] output[2] -> input[2] output[3] -> input[3] ...
Aside from that, @freedomtan I was wondering if our benchmark should include tokenization and detokenization, or will the datasets be in the form of tokens?
I managed to get input into and output out of the converted model. I used the python version of TFLite but will be moving the code to C++ since the test on python was successful.
The process is as follows:
- Copy a single token id into the token tensor and set the pos (starting at 0)
- run inference
- increment pos by 1 and copy kv caches from output to input
- repeat until all input tokens are consumed
- get output logits and convert to token id
- feed output token id into input
- run inference
- repeat step 3
- repeat from 5 until a limit is hit or 128001 (end of text token) is found
One thing to note is that if you take a list of input tensors and output tensors, and remove any non kv cache tensors (token, pos, and logits), the remaining tensors are ordered in such a way that the output tensors map directly to the corresponding input tensors, for example: output[0] -> input[0] output[1] -> input[1] output[2] -> input[2] output[3] -> input[3] ...
Aside from that, @freedomtan I was wondering if our benchmark should include tokenization and detokenization, or will the datasets be in the form of tokens?
Like what we did for the Stable Diffusion 1.5, we can use tokens directly for benchmark. Which means we don't need tokenizaton/de-tokenization part. We haven't discussed the dataset, or say the inputs to be used for our benchmark, yet.
I've attempted to generate an int8 quantized model, but no matter how much I tried, the output was garbled and nothing like the fp16 one.
I should mention that I did not notice any difference in the shape, type, or sorting of the tensors, and used the same steps as the fp16 model. Please let me know if I'm missing something
For now I'll try to get a pipeline of some sort to run the non-quantized model using flutter. Unless there's something else that takes priority, if so, please let me know.
I've attempted to generate an int8 quantized model, but no matter how much I tried, the output was garbled and nothing like the fp16 one.
I should mention that I did not notice any difference in the shape, type, or sorting of the tensors, and used the same steps as the fp16 model. Please let me know if I'm missing something
For now I'll try to get a pipeline of some sort to run the non-quantized model using flutter. Unless there's something else that takes priority, if so, please let me know.
Quantization of small dense language models is not an easy task.
Gemma 3 could be another candidate https://blog.google/technology/developers/gemma-3/
please
- test the 3b model
- check out exactly what the quantization by the ai-edge-torch does.
- use "standard" tflite quantization tool to quantize model.
@freedomtan to check if he can make the llama 3 quantized 1b work, too.
@farook-edev I tested the quantized llama 3.2 1b it tflite model just now (March 26th, 2025) on a Mac mini m4. It worked as expected.
- create a python venv environment for ai-edge-torch,
python3.12 -m venv ai-edge-torch -
source ai-edge-torch -
pip install ai-edge-torch -
mkdir work -
cd work -
git clone [email protected]:google-ai-edge/ai-edge-torch.git - download the 1b model from hugging face
-
cd ai_edge_torch -
python ai_edge_torch/generative/examples/llama/convert_to_tflite.py --checkpoint_path THE_MODELCHECKPOINT(by default, a weights-only quantized model will be at/tmp/llama_q8_ekv1280.tflite) - build the text_generator example,
bazel build -c opt //ai_edge_torch/generative/examples/cpp:text_generator_main - prepare llama 3.2 1b tokenizer,
python ai_edge_torch/generative/tools/tokenizer_to_sentencepiece.py --output_path=llama3_1b.spm.model --checkpoint=meta-llama/Llama-3.2-1B-Instruct - use the text_generator_main to run the quantized model,
bazel-bin/ai_edge_torch/generative/examples/cpp/text_generator_main \
--tflite_model=/tmp/llama_q8_ekv1280.tflite \
--sentencepiece_model=llama3_1b.spm.model \
--start_token="<bos>" --stop_token="<eos>" \
--num_threads=16 \
--prompt="Tell me something about systolic array."
Check if we can run the MMLU on Android. Mostly, we may need to use the TinyMMLU because even if we can run the MMLU, it takes a lot of time.
@freedomtan to check how to run MMLU/TinyMMLU on Android.
from client working group
- performance metrics: time-to-fist-token, tokens-per-second (excluding the first token, decoding).
- 4 categories
- context length: 4K (trying to increase to 8K, how about for mobile working group)
@Mostelk and @mohitmundhragithub to check which benchmark is good for accuracy check of summarization tasks.
It turns out running MMLU or tinyMMLU on Android with instruct-tuned models is quite trivial. Formatting the questions properly as input prompts then we can get expected results. For example, with MediaPipe's LLM inference example for Android, use prompts, such as
Question: The number of days it takes to build a new house has a variance of 386. A sample of 40 new homes shows an average building time of 83 days. With what confidence can we assert that the average building time for a new house is between 80 and 90 days?
A. 15.4%
B. 17.8%
C. 20.0%
D. 82.1%
Answer:
BTW, the tinyMMLU is a part of tinyBenchmarks, which seems to be a good set of benchmarks for mobile devices.
@Mostelk and @mohitmundhragithub o
Note that this is kinda model dependent. With llama 3.2 instruct-tuned models, the prompt above works. For Gemma 3 models, the answers is not the first token/char. The answers/generated tokens will start explain reasons then put the answer (A, B, C, or D) at the end.
@farook-edev Please check https://pytorch.org/executorch/stable/llm/llama-demo-android.html Please list what is needed to add ExecuTorch backend in our app, and what is needed, we can contact the right person to get help.
@anhappdev please help on check ExecuTorch https://github.com/pytorch/executorch, https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md
- out-of-memory, quantized 1B, w/ XNNPACK
- w/o XNNPACK: gibberish
- source: from ai-edge-torch
@freedomtan try to see if he can build the ai-edge-torch example for Android. And test the quantized 1B tflite model.
for the ai-edge-torch example, build with
$ bazel build -c opt \
--config android_arm64 --cxxopt=-std=c++17 \
//ai_edge_torch/generative/examples/cpp:text_generator_main
on Mac-mini, and then put the binary and models to /data/local/tmp/ of an Android device. And run
$ ./text_generator_main --tflite_model=llama-3.2-1b-q8.tflite -sentencepiece_model=llama3_1b.spm.model --start_token="<bos>" --stop_token="<eos>" --num_threads=4 --weight_cache_path=my_model.xnnpack_cache --prompt="Tell me something about systolic array."
in /data/local/tmp/
I got
That is, I got it running. The outputs look reasonable to me.
@farook-edev
@freedomtan How did you manage to get the weight cache? I believe my device was running out of memory while attempting to generate it.
@freedomtan How did you manage to get the weight cache? I believe my device was running out of memory while attempting to generate it.
Nothing special. I tested it on a colleague's Samsung Galaxy S24+, which has 12 GiB DRAM. Ran with the command line I shown above, then I got it.
@anhappdev please help @farook-edev to run test either on firebase or browserstack.
@freedomtan @farook-edev, to run the test on Firebase or BrowserStack, you can create a pull request for the changes. Then, the CI will run it. If you want to run it manually, it would take many steps, so I recommend using the PR.
Thanks @freedomtan @anhappdev, I managed to get the example to compile for the x86-64 emulator, I needed to add the following config to bazelrc:
build:android_x86_64 --config=android
build:android_x86_64 --cpu=x86_64
build:android_x86_64 --fat_apk_cpu=x86_64
build:android_x86_64 --cxxopt=-std=c++17
build:android_x86_64 --define=xnn_enable_avx=false
build:android_x86_64 --define=xnn_enable_avx2=false
build:android_x86_64 --define=xnn_enable_avx512=false
build:android_x86_64 --define=xnn_enable_avxvnni=false
build:android_x86_64 --define=xnn_enable_vnni=false
The defines were necessary because NDK uses a clang version that is too old and doesn't support AVX/VNNI.
I ran the executable on an emulator, and funny enough, it used 4.4G of ram, which was very slightly above what my physical device had.
The next step I assume is to build a pipeline for LLM based on this example. Could you please confirm? @freedomtan, Alternatively, I could help in testing the different datasets we discussed.
Thanks @freedomtan @anhappdev, I managed to get the example to compile for the x86-64 emulator, I needed to add the following config to bazelrc:
build:android_x86_64 --config=android build:android_x86_64 --cpu=x86_64 build:android_x86_64 --fat_apk_cpu=x86_64 build:android_x86_64 --cxxopt=-std=c++17 build:android_x86_64 --define=xnn_enable_avx=false build:android_x86_64 --define=xnn_enable_avx2=false build:android_x86_64 --define=xnn_enable_avx512=false build:android_x86_64 --define=xnn_enable_avxvnni=false build:android_x86_64 --define=xnn_enable_vnni=falseThe defines were necessary because NDK uses a clang version that is too old and doesn't support AVX/VNNI.
I ran the executable on an emulator, and funny enough, it used 4.4G of ram, which was very slightly above what my physical device had.
Does the binary you build run well (getting expected results at least)?
The next step I assume is to build a pipeline for LLM based on this example. Could you please confirm? @freedomtan, Alternatively, I could help in testing the different datasets we discussed.
Yes, please. If the x86_64 binary works well, please add the following:
- Time-to-first-token (ttft) and decode speed measurement.
- Hooks for backends to provide C callback functions.
- A new dataset and task.
Please refer to our simple documentation for guidance. While adding these features, please consider enhancing the documentation.