Native support for pre-computed image embeddings
Hi! Thank you for this stellar work.
I was wondering if there it would be possible to pass pre-computed image embeddings (image hidden states) to Gemma3's forward pass such that the vision tower would be bypassed?
Is this currently possible? Is it a feature that could possibly be implemented?
Thank you!
Hi @samb271 ,
Welcome to Google's Gemma models and thanks for your interest in the Gemma models, Thank you - your feedback is invaluable as we work to continuously improve the Gemma experience.
Thanks.
Hi @samb271, thanks for the question. While we're not actively working on new feature development for this project, and therefore directly supporting pre-computed image embeddings isn't planned, you might be able to achieve something similar by modifying the code. Would replacing siglip vision model in line 107 with your own custom vision encoder serve the purpose?
https://github.com/google/gemma_pytorch/blob/014acb7ac4563a5f77c76d7ff98f31b568c16508/gemma/gemma3_model.py#L107
@MayankChaturvedi Hi! Thank you for your answer! My intentions are actually to extract the vision encoder and modality projector of Gemma to pre-compute the image embeddings of a very small dataset in an offline setting. These could then be passed directly to the LLM backbone instead of the pixel values at test-time. Therefore, I don't wish to use my own vision encoder but simply to have a direct way to pass these pre-computed embeddings directly to the forward pass of the model. Do you have a general idea of which part of the code I should modify to accomplish this?