Tasks for v5.1
-
LLM Benchmark for Android (4 weeks)
Dataset --->> need to be adjusted to be a tokenized version of TinyMMLU -->> IFEval dataset need to be added Performance Mode: -->> LoadGen Query interface needs more work to query backends without interrupting them, call backs need to be implemented -->> Interface to query logits for decoding from backends (pending) -->> logging of output (token ids) during performance mode to allow accuracy audit
-->> Accuracy Mode not started --->> DeTokenizer using SentencePiece --->>>We need to try implement IFEVAL and TinyMMLU Evaluation on device, TinyMMLU mandatory on device, and IFEval at least dump result to json
-->> Model ... we only have toy model 1B (Freedom already provided TinyMMLU accuracy on device for 1B and 3B Llama 3.1 dynamic quantized with AI Edge Torch) ---->>> we need either 3B or 8B as group decides
Other tasks that need resources
- LLM Benchmark for iOS (4 weeks)
Not Started (do same as in Android & more)
- Continue Sergji effort to optimize models for CoreML (6 weeks)
-
5 models to optimize for iOS (we only have MobileNetv4 optimized)
-
CoreML iOS experience is needed to optimize legacy as well as GenAI (SD , LLM)
- Update TFLite/LiteRT backend (6 weeks) ->> Performance degraded on Pixel 10 ->> Debug/work with Google to fix it
@Mostelk I've cleaned up and formatted the description a bit. Please let me know if I missed or misplaced anything.
LLM Benchmark for Android (4 weeks)
Dataset
- [x] need to be adjusted to be a tokenized version of TinyMMLU
- [x] IFEval dataset need to be added
Performance Mode:
- [x] LoadGen Query interface needs more work to query backends without interrupting them, callbacks need to be implemented
- [ ] Interface to query logits for decoding from backends (pending)
- [ ] logging of output (token ids) during performance mode to allow accuracy audit
Misc
- [x] Accuracy Mode not started
- [x] DeTokenizer using SentencePiece
- [ ] We need to try implement IFEVAL and TinyMMLU Evaluation on device, TinyMMLU mandatory on device, and IFEval at least dump result to json
Model
We only have toy model 1B (Freedom already provided TinyMMLU accuracy on device for 1B and 3B Llama 3.1 dynamic quantized with AI Edge Torch).
- [ ] we need either 3B or 8B as group decides
LLM Benchmark for iOS (4 weeks)
Not Started (do same as in Android & more)
Continue Sergji effort to optimize models for CoreML (6 weeks)
- [ ] 5 models to optimize for iOS (we only have MobileNetv4 optimized)
- [ ] CoreML iOS experience is needed to optimize legacy as well as GenAI (SD , LLM)
Update TFLite/LiteRT backend (6 weeks)
- [ ] Performance degraded on Pixel 10. Debug/work with Google to fix it
- for Android one, let's try to have a running app on Pixel phones with LLM (3B or 8B) running.
@farook-edev for 3.2 3B Llama model and 3.1 8B model, need to know if we can bring them in our implementation (dynamic quantized TFLite) to decide on the benchmark model