mobile_app_open icon indicating copy to clipboard operation
mobile_app_open copied to clipboard

Tasks for v5.1

Open Mostelk opened this issue 5 months ago • 3 comments

  • LLM Benchmark for Android (4 weeks)

Dataset --->> need to be adjusted to be a tokenized version of TinyMMLU -->> IFEval dataset need to be added Performance Mode: -->> LoadGen Query interface needs more work to query backends without interrupting them, call backs need to be implemented -->> Interface to query logits for decoding from backends (pending) -->> logging of output (token ids) during performance mode to allow accuracy audit

-->> Accuracy Mode not started --->> DeTokenizer using SentencePiece --->>>We need to try implement IFEVAL and TinyMMLU Evaluation on device, TinyMMLU mandatory on device, and IFEval at least dump result to json

-->> Model ... we only have toy model 1B (Freedom already provided TinyMMLU accuracy on device for 1B and 3B Llama 3.1 dynamic quantized with AI Edge Torch) ---->>> we need either 3B or 8B as group decides

Other tasks that need resources

  • LLM Benchmark for iOS (4 weeks)

Not Started (do same as in Android & more)

  • Continue Sergji effort to optimize models for CoreML (6 weeks)
  • 5 models to optimize for iOS (we only have MobileNetv4 optimized)

  • CoreML iOS experience is needed to optimize legacy as well as GenAI (SD , LLM)

  • Update TFLite/LiteRT backend (6 weeks) ->> Performance degraded on Pixel 10 ->> Debug/work with Google to fix it

Mostelk avatar Sep 24 '25 22:09 Mostelk

@Mostelk I've cleaned up and formatted the description a bit. Please let me know if I missed or misplaced anything.

LLM Benchmark for Android (4 weeks)

Dataset

  • [x] need to be adjusted to be a tokenized version of TinyMMLU
  • [x] IFEval dataset need to be added

Performance Mode:

  • [x] LoadGen Query interface needs more work to query backends without interrupting them, callbacks need to be implemented
  • [ ] Interface to query logits for decoding from backends (pending)
  • [ ] logging of output (token ids) during performance mode to allow accuracy audit

Misc

  • [x] Accuracy Mode not started
  • [x] DeTokenizer using SentencePiece
  • [ ] We need to try implement IFEVAL and TinyMMLU Evaluation on device, TinyMMLU mandatory on device, and IFEval at least dump result to json

Model

We only have toy model 1B (Freedom already provided TinyMMLU accuracy on device for 1B and 3B Llama 3.1 dynamic quantized with AI Edge Torch).

  • [ ] we need either 3B or 8B as group decides

LLM Benchmark for iOS (4 weeks)

Not Started (do same as in Android & more)

Continue Sergji effort to optimize models for CoreML (6 weeks)

  • [ ] 5 models to optimize for iOS (we only have MobileNetv4 optimized)
  • [ ] CoreML iOS experience is needed to optimize legacy as well as GenAI (SD , LLM)

Update TFLite/LiteRT backend (6 weeks)

  • [ ] Performance degraded on Pixel 10. Debug/work with Google to fix it

farook-edev avatar Sep 25 '25 08:09 farook-edev

  • for Android one, let's try to have a running app on Pixel phones with LLM (3B or 8B) running.

freedomtan avatar Sep 30 '25 05:09 freedomtan

@farook-edev for 3.2 3B Llama model and 3.1 8B model, need to know if we can bring them in our implementation (dynamic quantized TFLite) to decide on the benchmark model

Mostelk avatar Oct 01 '25 22:10 Mostelk