mediapipe
mediapipe copied to clipboard
Is Gemma on device really this slow ?
I used llm_inference sample with gemma-2b-it-cpu-int4.bin on Pixel 8 Pro emulator.
The prefill speed seems to be in minutes.
Pixel 8 Pro configurations:- RAM - 22GB, VM heap - 512mb
Reference video https://github.com/googlesamples/mediapipe/assets/22965002/c7730dba-48e8-4eec-ae68-fe847d2778f2
Oh boy, no definitely not. It's not really intended to be run on the emulator, so your results are going to vary wildly. Here's a presentation I did last week with a slide showing Gemma running on a device in real-time (not sped up or altered, just recorded and turned into a gif) https://docs.google.com/presentation/d/1uetAcmkNWDXHEJaCt6WoBflDM1iMUU1N1ahzQof6PLM/edit#slide=id.g26cd5c56ad9_1_30
I saw a post suggesting emulator with increased ram works similarly. Here it is - link - Search for "Creating an Android Emulator with Increased RAM"
What's the difference that makes physical device so much faster ? Is it particularly customized for gemma ?
Thanks for the prompt response!
No idea on that level of detail. My general experience over the last 10+ years with Android development though has always been "Eh, emulators are OK, but never as good as a real device"
Time to first token is still pretty slow compared to the video you shared. Takes around 15 seconds for both 4bit and 8bit cpu versions of gemma2b. Physical device that I am using is pixel 7 pro.
i am using recent gemma 2 as well in my android pixel device and still the performance is too slow. is there anything we can do increase the performance in the andorid device. thanks
Echo to Paul's point, our infra was not well tested on emulator and there is no performance guarantee there. However, there is a known issue to run Gemma 2 model on real device that is causing the speed (i.e. time to first token) to be slow. We are actively working on it and hopefully it'll be resolved by the end of this year. Please stay patient and thanks.