[OpenVINO backend] supporting inference for gemma with ov backend
Description of the change
As a part of my GSoC25 project to support inference with the openvino backend for Gemma, This is my PR for supporting the Gemma pipeline.
Reference
https://docs.openvino.ai/2025/index.html https://keras.io/api/ https://keras.io/keras_hub/
Colab Notebook
Checklist
- [x] I have added all the necessary unit tests for my change.
- [x] I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
- [x] My PR is based on the latest changes of the main branch (if unsure, rebase the code).
- [x] I have followed the Keras Hub Model contribution guidelines in making these changes.
- [x] I have followed the Keras Hub API design guidelines in making these changes.
- [x] I have signed the Contributor License Agreement.
Left some initial comments! But probably first question is around the changes to causal_lm and gemma_causal_lm. Why is this so backend specific? This is much more involved than changes for jax/torch/tensorflow
hi @mattdangerw , I’ve been working on enabling inference for Gemma with OpenVINO by implementing the missing operations. The main challenge I ran into is that building the entire graph as a single model makes execution difficult (takes too long + RAM may overflow) , I'm still investigating why. To address this, I introduced a subgraph-based approach by splitting the full graph at key points. I also added logic to store compiled subgraphs in CausalLM so they can be reused across generations instead of rebuilding them each time.
@Mohamed-Ashraf273 is there a way that we can land this without the subgraph approach?
We have a similar need in Jax at train time. Compilation times are much improved if you run a common transformer block in a compiled loop. So probably there is something to do here, but we'd really like to avoid our forward pass being a switch case on backend here. That will lead to maintenance hell.
So maybe let's try to land with the same approach as other backends for now, and see if there's a layer stacking/compilation reuse solution we can land as follow up?
hi @mattdangerw I removed the subgraph approach + removed reusing part. Now the model can be inferred with OpenVINO and pass all tests, we just need to think about how to optimize inference with large parameters and real weights without RAM overflow. I'd appreiciate tit if you can take another look. Thanks!
@mattdangerw My PR is ready for review!
@fchollet Can you take a look?
@mattdangerw
Hi @fchollet , I'd appreciate any feedback on my PR. thanks
/gemini review
@fchollet @mattdangerw @rkazants @divyashreepathihalli
@mattdangerw @divyashreepathihalli