mobile_app_open Master issue: LLM Benchmark

What can we do for the default backend?

Model: Let's start with llama 3.2 1B, preferably quantized ones
Runtime:
- MediaPipe and AI-Edge-Torch: our current default backend is tflite based, so if we can continue using TFLite-based solution, it would be good.
- - new API in LiteRT (https://ai.google.dev/edge/litert)
- ExecuTorch: https://github.com/pytorch/executorch, https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md
- llama.cpp: https://github.com/ggerganov/llama.cpp
- onnx runtime: https://github.com/microsoft/onnxruntime, let's check if it works on Android platforms.
- onnx runtime genai: https://github.com/microsoft/onnxruntime-genai

Dec 24 '24 07:12 freedomtan

I've looked into building each runtime for Android and here's what I found out:

TFLite

Google's warning about AI-Edge-Torch being experimental is quite the understatement. After much struggling I ended up getting the nightly versions of ai-edge-torch and ai-edge-quantizer which were released the the same day as the last edit to the llama example was made. That was the only way to get the thing to actually function. Once it started it used up all 11GB my GPU had and promptly failed. Trying to get it to use the CPU ended with a C++ protobuf error. So I gave up and downloaded pre-converted models from HuggingFace, and attempted to use them with MediaPipe. The android app compiled but ~would give a PERMISSION_DENIED error when attempting to load the model~.

Edit

I managed to get a model loaded by using /data/ directory instead of /sdcard/ seems like Android had some protection for internal storage. it worked but the model did not respond to my prompt properly, instead giving python examples...

Executorch

Executorch seemed to have the most support and robust examples behind it. ~But unfortunately I couldn't even install the library's dependencies because of a cmake error regarding gflags. I want to look further into it to see if I can set it up locally since it's quite promising.~

Edit

I was able to get Executorch working. The cmake error was resolved by removing my local gflags package, seems like it was conflicting with executorch's gflags in the venv.

Executorch has quite a bit of steps to get the .aar library compiled for android, which caused me trouble when using an NDK version lower than 25, their main guide for the llama example does mention that they support only version 27, but this isn't mentioned in the android demo app guide. I was also able to compile the library using NDK version 28.

Getting the models wasn't very straightforward either, since they needed to be converted from .pth to .pte. Thankfully executorch's converter worked without issue (unlike AI-Edge-Torch).

Running the app required upgrading gradle to 8.5 which thankfully Android Studio made quite straightforward, once that was done the app compiled and ran without issue.

I attempted to use the unquantized model first. Upon pressing load in the android app, my Galaxy S8 completely froze for 5 minutes until I restarted it. Possibly due to RAM capacity. Loading the quantized model did not cause any issues, and the model worked on my machine flawlessly.

This was done using the xnnpack backend, Executorch also supports Qualcomm and Mediatek AI engines which I did not test.

llama.cpp

working with llama.cpp has been the most straightforward, the app compiled and launched without issue, a couple of line changes allowed me to download whichever custome GGUF model I wanted, and the models actually functioned (They were however extremely slow on my Galaxy S8). There was an issue where manually loading the models into my phone caused them to not be readable, but was fixed by having the phone download the models instead of my PC.

Conclusion

My priority was TFLite because it's what we use. but it seems that both MediaPipe and AI-Edge-Torch are still in their infancy, and could cause problems when attempting to integrate them into the mobile app. It's a similar story with ExecuTorch but in a different direction, I like the documentation and support, though it is quite complicated to get running, especially if you stray from the guides. That leaves us with llama.cpp,models for which seem to be available. And while I've not attempted to convert my own (because HuggingFace has GGUF versions of the models), the models actually working without me having to dig through files upon files of code was a welcome change.

~I'll look further into installing Executorch for now, but until I can get it running,~ llama.cpp was the only runtime I could get to work without struggle. But Executorch's converter and android app (when they didn't crash the entire machine) were quite impressive. It's worth mentioning while executorch is built around pytorch and its models, llama.cpp supports quite a few LLMs other than llama.

Jan 13 '25 10:01 farook-edev

For model conversion with AI-Edge-Torch, as I noted at https://github.com/google-ai-edge/ai-edge-torch/issues/269#issuecomment-2505304355

you don't need GPU
you need a lot of DRAM, which is supposed to be a bug / feature of current AI-Edge-Torch implementation, (I don't remember the amount needed, supposedly 64 GiB is enough).
I used Colab to do it.

Jan 14 '25 03:01 freedomtan

I tried getting AI-Edge-Torch to work again, this time using a docker container. The code seemed to run without issue until it killed itself, seemingly due to lack of RAM (my machine only has 32GB). But with this setup a local CPU can be used to convert the models. It's just extremely finicky and easily prone to breaking.

Jan 15 '25 19:01 farook-edev

@Mostelk: there is a GPT2 Android tflite app, https://github.com/huggingface/tflite-android-transformers/tree/master/gpt2 @mohitmundhragithub: there is https://github.com/mlcommons/mlperf_client which runs llama 2 7B

Jan 21 '25 06:01 freedomtan

@freedomtan to check the GPT2 and the https://github.com/mlcommons/mlperf_client.

Feb 04 '25 06:02 freedomtan

check that input format of huggingface tflite models (supposedly, you can simple cast integer token ids to fp16, setting the input tensors, if the expected input is token ids rather than some embeddeding after feeding token ids into some gather ops).

Feb 18 '25 06:02 freedomtan

check that input format of huggingface tflite models (supposedly, you can simple cast integer token ids to fp16, setting the input tensors, if the expected input is token ids rather than some embeddeding after feeding token ids into some gather ops).

@farook-edev it turns out it's a bit complicated. If you check the tflite models, you can find that their inputs and outputs are not simple. There are

input token and input pos: int32, that is token id (from tokenizer) and token pos (0, 1, 2, ...)
KV cache matrixes: float32

How to inspect the models:

visualization tools such as https://ai.google.dev/edge/model-explorer
couple lines of code using python api

import tensorflow as tf
import sys

interpreter = tf.lite.Interpreter(model_path='/tmp/foo.tflite')
print(interpreter.get_input_details())

Feb 18 '25 07:02 freedomtan

I managed to get input into and output out of the converted model. I used the python version of TFLite but will be moving the code to C++ since the test on python was successful.

The process is as follows:

Copy a single token id into the token tensor and set the pos (starting at 0)
run inference
increment pos by 1 and copy kv caches from output to input
repeat until all input tokens are consumed
get output logits and convert to token id
feed output token id into input
run inference
repeat step 3
repeat from 5 until a limit is hit or 128001 (end of text token) is found

One thing to note is that if you take a list of input tensors and output tensors, and remove any non kv cache tensors (token, pos, and logits), the remaining tensors are ordered in such a way that the output tensors map directly to the corresponding input tensors, for example: output[0] -> input[0] output[1] -> input[1] output[2] -> input[2] output[3] -> input[3] ...

Aside from that, @freedomtan I was wondering if our benchmark should include tokenization and detokenization, or will the datasets be in the form of tokens?

Mar 02 '25 04:03 farook-edev

I managed to get input into and output out of the converted model. I used the python version of TFLite but will be moving the code to C++ since the test on python was successful.

The process is as follows:

Copy a single token id into the token tensor and set the pos (starting at 0)

run inference

increment pos by 1 and copy kv caches from output to input

repeat until all input tokens are consumed

get output logits and convert to token id

feed output token id into input

run inference

repeat step 3

repeat from 5 until a limit is hit or 128001 (end of text token) is found

One thing to note is that if you take a list of input tensors and output tensors, and remove any non kv cache tensors (token, pos, and logits), the remaining tensors are ordered in such a way that the output tensors map directly to the corresponding input tensors, for example: output[0] -> input[0] output[1] -> input[1] output[2] -> input[2] output[3] -> input[3] ...

Aside from that, @freedomtan I was wondering if our benchmark should include tokenization and detokenization, or will the datasets be in the form of tokens?

Like what we did for the Stable Diffusion 1.5, we can use tokens directly for benchmark. Which means we don't need tokenizaton/de-tokenization part. We haven't discussed the dataset, or say the inputs to be used for our benchmark, yet.

Mar 04 '25 06:03 freedomtan

I've attempted to generate an int8 quantized model, but no matter how much I tried, the output was garbled and nothing like the fp16 one.

I should mention that I did not notice any difference in the shape, type, or sorting of the tensors, and used the same steps as the fp16 model. Please let me know if I'm missing something

For now I'll try to get a pipeline of some sort to run the non-quantized model using flutter. Unless there's something else that takes priority, if so, please let me know.

Mar 11 '25 06:03 farook-edev

I've attempted to generate an int8 quantized model, but no matter how much I tried, the output was garbled and nothing like the fp16 one.

I should mention that I did not notice any difference in the shape, type, or sorting of the tensors, and used the same steps as the fp16 model. Please let me know if I'm missing something

For now I'll try to get a pipeline of some sort to run the non-quantized model using flutter. Unless there's something else that takes priority, if so, please let me know.

Quantization of small dense language models is not an easy task.

Mar 12 '25 11:03 freedomtan

Gemma 3 could be another candidate https://blog.google/technology/developers/gemma-3/

Mar 12 '25 11:03 freedomtan

please

test the 3b model
check out exactly what the quantization by the ai-edge-torch does.
use "standard" tflite quantization tool to quantize model.

Mar 25 '25 05:03 freedomtan

@freedomtan to check if he can make the llama 3 quantized 1b work, too.

Mar 25 '25 05:03 freedomtan

@farook-edev I tested the quantized llama 3.2 1b it tflite model just now (March 26th, 2025) on a Mac mini m4. It worked as expected.

create a python venv environment for ai-edge-torch, python3.12 -m venv ai-edge-torch
source ai-edge-torch
pip install ai-edge-torch
mkdir work
cd work
git clone [email protected]:google-ai-edge/ai-edge-torch.git
download the 1b model from hugging face
cd ai_edge_torch
python ai_edge_torch/generative/examples/llama/convert_to_tflite.py --checkpoint_path THE_MODELCHECKPOINT (by default, a weights-only quantized model will be at /tmp/llama_q8_ekv1280.tflite)
build the text_generator example, bazel build -c opt //ai_edge_torch/generative/examples/cpp:text_generator_main
prepare llama 3.2 1b tokenizer, python ai_edge_torch/generative/tools/tokenizer_to_sentencepiece.py --output_path=llama3_1b.spm.model --checkpoint=meta-llama/Llama-3.2-1B-Instruct
use the text_generator_main to run the quantized model,

bazel-bin/ai_edge_torch/generative/examples/cpp/text_generator_main \
  --tflite_model=/tmp/llama_q8_ekv1280.tflite \
  --sentencepiece_model=llama3_1b.spm.model \
  --start_token="<bos>" --stop_token="<eos>" \
  --num_threads=16 \
  --prompt="Tell me something about systolic array."

Mar 26 '25 04:03 freedomtan

Check if we can run the MMLU on Android. Mostly, we may need to use the TinyMMLU because even if we can run the MMLU, it takes a lot of time.

@freedomtan to check how to run MMLU/TinyMMLU on Android.

Apr 01 '25 05:04 freedomtan

from client working group

performance metrics: time-to-fist-token, tokens-per-second (excluding the first token, decoding).
4 categories
context length: 4K (trying to increase to 8K, how about for mobile working group)

@Mostelk and @mohitmundhragithub to check which benchmark is good for accuracy check of summarization tasks.

Apr 01 '25 05:04 freedomtan

It turns out running MMLU or tinyMMLU on Android with instruct-tuned models is quite trivial. Formatting the questions properly as input prompts then we can get expected results. For example, with MediaPipe's LLM inference example for Android, use prompts, such as

Question: The number of days it takes to build a new house has a variance of 386. A sample of 40 new homes shows an average building time of 83 days. With what confidence can we assert that the average building time for a new house is between 80 and 90 days?
A. 15.4%
B. 17.8%
C. 20.0%
D. 82.1%
Answer:

BTW, the tinyMMLU is a part of tinyBenchmarks, which seems to be a good set of benchmarks for mobile devices.

@Mostelk and @mohitmundhragithub o

Note that this is kinda model dependent. With llama 3.2 instruct-tuned models, the prompt above works. For Gemma 3 models, the answers is not the first token/char. The answers/generated tokens will start explain reasons then put the answer (A, B, C, or D) at the end.

Apr 02 '25 02:04 freedomtan

@farook-edev Please check https://pytorch.org/executorch/stable/llm/llama-demo-android.html Please list what is needed to add ExecuTorch backend in our app, and what is needed, we can contact the right person to get help.

Apr 16 '25 22:04 Mostelk

@anhappdev please help on check ExecuTorch https://github.com/pytorch/executorch, https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md

Apr 22 '25 05:04 freedomtan

out-of-memory, quantized 1B, w/ XNNPACK
w/o XNNPACK: gibberish
source: from ai-edge-torch

@freedomtan try to see if he can build the ai-edge-torch example for Android. And test the quantized 1B tflite model.

May 06 '25 05:05 freedomtan

for the ai-edge-torch example, build with

$ bazel build -c opt \
  --config android_arm64 --cxxopt=-std=c++17 \
  //ai_edge_torch/generative/examples/cpp:text_generator_main

on Mac-mini, and then put the binary and models to /data/local/tmp/ of an Android device. And run

$ ./text_generator_main --tflite_model=llama-3.2-1b-q8.tflite -sentencepiece_model=llama3_1b.spm.model  --start_token="<bos>" --stop_token="<eos>" --num_threads=4 --weight_cache_path=my_model.xnnpack_cache --prompt="Tell me something about systolic array."

in /data/local/tmp/

I got

That is, I got it running. The outputs look reasonable to me.

@farook-edev

May 07 '25 03:05 freedomtan

@freedomtan How did you manage to get the weight cache? I believe my device was running out of memory while attempting to generate it.

May 11 '25 17:05 farook-edev

@freedomtan How did you manage to get the weight cache? I believe my device was running out of memory while attempting to generate it.

Nothing special. I tested it on a colleague's Samsung Galaxy S24+, which has 12 GiB DRAM. Ran with the command line I shown above, then I got it.

May 12 '25 01:05 freedomtan

@anhappdev please help @farook-edev to run test either on firebase or browserstack.

May 13 '25 06:05 freedomtan

@freedomtan @farook-edev, to run the test on Firebase or BrowserStack, you can create a pull request for the changes. Then, the CI will run it. If you want to run it manually, it would take many steps, so I recommend using the PR.

May 13 '25 09:05 anhappdev

Thanks @freedomtan @anhappdev, I managed to get the example to compile for the x86-64 emulator, I needed to add the following config to bazelrc:

build:android_x86_64 --config=android
build:android_x86_64 --cpu=x86_64
build:android_x86_64 --fat_apk_cpu=x86_64
build:android_x86_64 --cxxopt=-std=c++17
build:android_x86_64 --define=xnn_enable_avx=false
build:android_x86_64 --define=xnn_enable_avx2=false
build:android_x86_64 --define=xnn_enable_avx512=false
build:android_x86_64 --define=xnn_enable_avxvnni=false
build:android_x86_64 --define=xnn_enable_vnni=false

The defines were necessary because NDK uses a clang version that is too old and doesn't support AVX/VNNI.

I ran the executable on an emulator, and funny enough, it used 4.4G of ram, which was very slightly above what my physical device had.

May 15 '25 15:05 farook-edev

The next step I assume is to build a pipeline for LLM based on this example. Could you please confirm? @freedomtan, Alternatively, I could help in testing the different datasets we discussed.

May 15 '25 15:05 farook-edev

Thanks @freedomtan @anhappdev, I managed to get the example to compile for the x86-64 emulator, I needed to add the following config to bazelrc:
build:android_x86_64 --config=android
build:android_x86_64 --cpu=x86_64
build:android_x86_64 --fat_apk_cpu=x86_64
build:android_x86_64 --cxxopt=-std=c++17
build:android_x86_64 --define=xnn_enable_avx=false
build:android_x86_64 --define=xnn_enable_avx2=false
build:android_x86_64 --define=xnn_enable_avx512=false
build:android_x86_64 --define=xnn_enable_avxvnni=false
build:android_x86_64 --define=xnn_enable_vnni=false
The defines were necessary because NDK uses a clang version that is too old and doesn't support AVX/VNNI.

I ran the executable on an emulator, and funny enough, it used 4.4G of ram, which was very slightly above what my physical device had.

Does the binary you build run well (getting expected results at least)?

May 16 '25 01:05 freedomtan

The next step I assume is to build a pipeline for LLM based on this example. Could you please confirm? @freedomtan, Alternatively, I could help in testing the different datasets we discussed.

Yes, please. If the x86_64 binary works well, please add the following:

Time-to-first-token (ttft) and decode speed measurement.
Hooks for backends to provide C callback functions.
A new dataset and task.

Please refer to our simple documentation for guidance. While adding these features, please consider enhancing the documentation.

May 16 '25 01:05 freedomtan