gemini-cli Utilize Gemma

What would you like to be added?

Hi there

I've been looking forward to using Google's CLI for a while (ever since seeing competing technologies of codex and claude). It's really great seeing the potential realized in this initial effort. Thank you for reading this feature request:

It would be great if there was an option to use Gemma with google-cli. There are many benefits this could provide to users, teams and Google itself. In case these are not immediately obvious, here's a list:

Users:

Users would always have some kind of response from cli requests, just not the high-quality cloud-delivered responses from Gemini. All-day interaction (which seems to be a key objective of this tool)
Users would take a lot longer to use up the daily allotment of interaction with google-cli, then gracefully fallback to local only interaction.
As Gemma evolves, the baseline of interaction quality would gradually improve.

Teams:

Using local models opens the door to using a different model than specifically Gemma as the baseline.
Teams could potentially create their own baseline model, train for specific purpose and context.
In such a world, it wouldn't be impossible to leverage Google Drive for storage and delivery of models that can stay within the organization and lot leak into public domain.

Google:

You must be eating a HUGE cost from all of this interaction. Local models would help buffer a portion of this interaction.
I understand that you're building in logic to determine which model should serve the response of a request. You can expand this logic to include local models. You shouldn't have to pay for responding to "Good Morning"

And this is just the beginning of this idea. These are the impacts that are obvious to me, a developer who isn't privy to your grand plans and ideas. I believe you are in a good position to leverage the decent quality of the Gemma local model to take advantage of this strategy. Although we should call out the negatives of this feature:

Negative aspects to mitigate:

Size: Although Gemma comes in many sizes, the largest versions would likely be too large for 99% of users and the smallest might still be too large for a significant portion of users (when considering CPU / GPU, RAM, and other hardware complications)
Feature creep: This would be a SIGNIFICANT undertaking, likely too big of an effort to creep in at this stage. Might be a good gooogle-cli:2.0.0 feature.

What do you think?

Why is this needed?

If all day google-cli interaction is a key objective, why not guarantee the completion of that user story by utilizing local models?

Additional context

No response

Jun 26 '25 15:06 cosmicdreams

+1 This would be a VERY useful extension of the capabilities of Gemini CLI on-device and off-line

Jun 26 '25 15:06 diro5t

This is an amazing idea and I have a working prototype for this application that I was working this week. You can check it out here: https://github.com/sairambokka/ollama-coding-agent

I did fork gemini-cli and I probably will make a version where we can use gemma. Gemini cli is a good starting point for this to happen.

Jun 26 '25 16:06 sairambokka

For reference, there are already multiple open source CLI coding assistants which could utilize local LLMs like Gemma, e.g. Aider, MyCoder, Goose, Devseeker, Codai, oli, Open Codex, OCode and more. But if Gemini CLI provided support for local LLMs, it would be a big competitor to them.

Jun 27 '25 09:06 niutech

Introducing Gemma 3n: The developer guide - Google Developers Blog https://share.google/8xr7dGnwnGYwDzW4l

Gemma 3n might be a good place to start

Jun 29 '25 05:06 cosmicdreams

Gemini-CLI will load Gemma-3 and Gemma-3n models from Google but they fail

Suggests to me that they're not gonna work for function calling via API at all?

Jul 01 '25 09:07 GhostArchitect01

Oh interesting. Perhaps that issue is on the gemma side of things. Not implementing the proper api responses.

@GhostArchitect01 can this issue be solved by teaching gemma how to properly handle prompt responses?

Can you modify it's responses?

Jul 01 '25 14:07 cosmicdreams

Related: https://github.com/BerriAI/litellm/issues/10313

Jul 01 '25 14:07 cosmicdreams

That issue points to this solution: https://github.com/IllFil/gemma3-ollama-tools

Jul 01 '25 14:07 cosmicdreams

I don't see how that solution is helpful - we're not using local models here.

Jul 01 '25 16:07 GhostArchitect01

Reviewing Gemini-CLI a little bit I think modifying config.ts and models.ts is worth trying - it seems models are defined there.

Jul 01 '25 16:07 GhostArchitect01

Relevant: https://www.philschmid.de/gemma-function-calling

Jul 01 '25 18:07 niutech

I know Gemma 3 has function calling. The question is 'is it possible to use Gemma 3/3n in Gemini CLI' and Gemini CLI says function calling is not enabled.

Function Calling via API seems to be possible - so the question is whether this is server side and specific to Gemini-CLI or if it's a configuration issue (config.ts and models.ts seem to be good places to look).

I guess another option is to set up an MCP that allows Gemma 3/3n to use the Gemini-CLI tools?

Jul 01 '25 18:07 GhostArchitect01

I have gemma running locally through lmstudio using this fork https://github.com/acoliver/gemini-cli/tree/multi-provider I added gemma's specific tool format parsing. (also claude, openai and I'm working on deepseek and qwen)

Jul 02 '25 17:07 acoliver

I have gemma running locally through lmstudio using this fork https://github.com/acoliver/gemini-cli/tree/multi-provider I added gemma's specific tool format parsing. (also claude, openai and I'm working on deepseek and qwen)

Is there any reason it's only local? Does the Google API not let you call Gemma-3 models even after the tool formatting?

Jul 02 '25 18:07 GhostArchitect01

It will probably work if you put in your API key. The "auth" method only allows gemini pro and flash. It isn't a Gemini-cli limitation, it is an intentional backend limitation. I'll test that later today.

Jul 02 '25 18:07 acoliver

Doing some semi-related work via #2318

Jul 03 '25 11:07 Justin-Randall

you can now use the fork:

Install npm install -g @vybestack/llxprt-code npx @vybestack/llxprt-code docker run -it ghcr.io/acoliver/llxprt-code/sandbox:0.1.12

Jul 19 '25 02:07 acoliver

you can now use the fork:

Install npm install -g @vybestack/llxprt-code npx @vybestack/llxprt-code docker run -it ghcr.io/acoliver/llxprt-code/sandbox:0.1.12

I've been following, thank you!

Jul 19 '25 02:07 GhostArchitect01

I managed to make it work to a very limited extent with LiteLLM. I'm running Gemma3 using "docker model" and the version available on their hub. Unfortunately it only works well if your prompt is short. It fails as soon as you have a moderate prompt and ask that it does things into your codebase..

This is the error I get: ✕ [API Error: Model stream ended with an invalid chunk or missing finish reason.]

Could be related to this in my case --> https://github.com/google-gemini/gemini-cli/issues/7851

Env variables:

set GEMINI_API_KEY=dummy-key
set GOOGLE_GEMINI_BASE_URL=http://localhost:4000

config.yaml of LiteLLM:

model_list:
  - model_name: ai/gemma3
    litellm_params:
      model: openai/ai/gemma3
      api_base: http://localhost:12434/engines/v1
      api_key: "dummy-key

router_settings:
  model_group_alias: {"gemini-2.5-pro": "ai/gemma3"}

Docker:

docker desktop enable model-runner --tcp=12434
docker model run ai/gemma3

Sep 27 '25 20:09 kdrapelinexto

I see that last week the "enhancement" tag was removed from this issue. I wonder what that communicates? Are you not planning to implement this feature?

Do you not want a user to be able to include Google specific local models in a collection of available models that could handle a user's input?
Do you disagree that the cost savings of running models locally is worth the prospective value of studying user input when it comes over internet onto Google's servers?
Would you rather simplify the scope of problems you will have to solve in the future by not including the whole problemset of local models handling requests?

Help me understand the current state of thinking about this feature addition.

Nov 05 '25 15:11 cosmicdreams

@cosmicdreams @kdrapelinexto FYI we added to LLxprt Code (a gemini-cli fork) https://github.com/vybestack/llxprt-code/blob/main/docs/local-models.md - port settings may resolve the issue you saw https://github.com/vybestack/llxprt-code/blob/main/docs/prompt-configuration.md - models like gemma running locally can't have a huge system prompt so you can shrink it

you can also use say gemini-pro (or your claude max account) with gemma subagents in the upcoming release.

Nov 05 '25 16:11 acoliver

I would suggest using the local capacity of the users (or the user) to offload the entire system.

Dec 15 '25 07:12 Skynetuz