mobile_app_open Tasks for v5.1

LLM Benchmark for Android (4 weeks)

Dataset --->> need to be adjusted to be a tokenized version of TinyMMLU -->> IFEval dataset need to be added Performance Mode: -->> LoadGen Query interface needs more work to query backends without interrupting them, call backs need to be implemented -->> Interface to query logits for decoding from backends (pending) -->> logging of output (token ids) during performance mode to allow accuracy audit

-->> Accuracy Mode not started --->> DeTokenizer using SentencePiece --->>>We need to try implement IFEVAL and TinyMMLU Evaluation on device, TinyMMLU mandatory on device, and IFEval at least dump result to json

-->> Model ... we only have toy model 1B (Freedom already provided TinyMMLU accuracy on device for 1B and 3B Llama 3.1 dynamic quantized with AI Edge Torch) ---->>> we need either 3B or 8B as group decides

Other tasks that need resources

LLM Benchmark for iOS (4 weeks)

Not Started (do same as in Android & more)

Continue Sergji effort to optimize models for CoreML (6 weeks)

5 models to optimize for iOS (we only have MobileNetv4 optimized)
CoreML iOS experience is needed to optimize legacy as well as GenAI (SD , LLM)

Update TFLite/LiteRT backend (6 weeks) ->> Performance degraded on Pixel 10 ->> Debug/work with Google to fix it

Sep 24 '25 22:09 Mostelk

@Mostelk I've cleaned up and formatted the description a bit. Please let me know if I missed or misplaced anything.

LLM Benchmark for Android (4 weeks)

Dataset

[x] need to be adjusted to be a tokenized version of TinyMMLU
[x] IFEval dataset need to be added

Performance Mode:

[x] LoadGen Query interface needs more work to query backends without interrupting them, callbacks need to be implemented
[ ] Interface to query logits for decoding from backends (pending)
[ ] logging of output (token ids) during performance mode to allow accuracy audit

Misc

[x] Accuracy Mode not started
[x] DeTokenizer using SentencePiece
[ ] We need to try implement IFEVAL and TinyMMLU Evaluation on device, TinyMMLU mandatory on device, and IFEval at least dump result to json

Model

We only have toy model 1B (Freedom already provided TinyMMLU accuracy on device for 1B and 3B Llama 3.1 dynamic quantized with AI Edge Torch).

[ ] we need either 3B or 8B as group decides

LLM Benchmark for iOS (4 weeks)

Not Started (do same as in Android & more)

Continue Sergji effort to optimize models for CoreML (6 weeks)

[ ] 5 models to optimize for iOS (we only have MobileNetv4 optimized)
[ ] CoreML iOS experience is needed to optimize legacy as well as GenAI (SD , LLM)

Update TFLite/LiteRT backend (6 weeks)

[ ] Performance degraded on Pixel 10. Debug/work with Google to fix it

Sep 25 '25 08:09 farook-edev

for Android one, let's try to have a running app on Pixel phones with LLM (3B or 8B) running.

Sep 30 '25 05:09 freedomtan

@farook-edev for 3.2 3B Llama model and 3.1 8B model, need to know if we can bring them in our implementation (dynamic quantized TFLite) to decide on the benchmark model

Oct 01 '25 22:10 Mostelk