opencode icon indicating copy to clipboard operation
opencode copied to clipboard

Feat: Add ability to specify "num_ctx" for Ollama provider

Open TheSpaceGod opened this issue 2 months ago • 12 comments

Why is Per-Request Control of num_ctx Important on Consumer GPUs?

On consumer-grade GPUs, VRAM is a scarce and critical resource. The context window size (num_ctx) is one of the largest consumers of VRAM, as the memory required for the KV cache grows linearly with the number of tokens in the context.

  • VRAM Management is Key: Consumer GPUs (like the NVIDIA 30 and 40 series) have limited VRAM (e.g., 8GB, 12GB, 24GB). A large, fixed context window can consume all available VRAM, preventing larger or more capable models from even loading. It also leaves no room for processing longer documents or complex conversations.
  • Task-Specific Needs: Different tasks require different context sizes. A simple question might only need a 2048-token context, while summarizing a large document or analyzing a codebase could require 32,000 tokens or more. Forcing a one-size-fits-all context is inefficient. A user should be able to use a small context for simple tasks to conserve VRAM and a large one for complex tasks, without having to reload the model.
  • Performance Impact: When the KV cache and model weights exceed available VRAM, the system has to offload layers to much slower system RAM. This causes a dramatic drop in performance (token generation speed), making the user experience sluggish and often unusable. Dynamic control over num_ctx allows the user to balance context size with performance based on their specific hardware limitations.

Why a Client-Side Setting is Better Than a Server-Only Config

While num_ctx can be set globally on the Ollama server, this approach is inflexible and inefficient for a tool like a VS Code extension.

  • Flexibility for Different Projects: A developer might be working on multiple projects with different needs. One project might involve a large legacy codebase requiring a massive context, while another is a small script. A client-side setting allows the developer to adjust the context size on a per-project or even per-task basis without reconfiguring and restarting the central Ollama server.
  • Prevents Model Reloading: If different clients (like opencode, Open WebUI, etc.) request the same model but with different num_ctx values, the Ollama server is forced to unload and reload the model for each change. This introduces significant delays and creates a poor user experience. A client-side setting allows for sending the desired num_ctx with each request, which is a more efficient and standard way of interacting with the Ollama API.
  • Multi-User Environments: In a scenario where a single Ollama instance serves multiple users or applications, a server-wide setting is impractical. Each user and application will have different requirements, and a client-side parameter allows them to coexist without interfering with each other.

How Other Projects Have Implemented This

Other popular clients for Ollama have already recognized the importance of this feature and provide straightforward ways to configure it:

  • Open WebUI: In the chat interface, there is a settings panel for each conversation where the "Context Length" (num_ctx) can be explicitly set. This parameter is then passed along with the API request to Ollama for that specific chat session.
  • Cline: The documentation for Cline explicitly recommends a larger context window for coding tasks (at least 32k tokens). In its settings, under "API Configuration," there is a "Context Window" field where the user can specify the desired num_ctx.

By implementing a similar client-side configuration, opencode would align with best practices in the ecosystem and provide its users with the necessary control to effectively use local LLMs on consumer hardware.

TheSpaceGod avatar Oct 17 '25 23:10 TheSpaceGod

This issue might be a duplicate of existing issues. Please check:

  • #1068: Tool use with Ollama models - This issue shows a working configuration that already includes "options": { "num_ctx": 65536 } and "options": { "num_ctx": 131072 } for Ollama models, suggesting this feature may already be implemented.

Feel free to ignore if none of these address your specific case.

github-actions[bot] avatar Oct 17 '25 23:10 github-actions[bot]

is it best to let the user configure or should we set 32k as the default and then let them change it?

rekram1-node avatar Oct 18 '25 00:10 rekram1-node

Thats a tough one. I would lean towards saying just let the user configure it since its so problem and hardware dependent. The ollama num_ctx default will probably not be good enough for opencode period, but thats the same for many other tools using ollama as the background LLM provider. People using ollama need to be educated enough on the situation in order to improve their overall system performance and know the pros/cons of modifying num_ctx.

The only thing that might be nice to add is a little warning that says the input is likely getting truncated by ollama with its default of 4k character num_ctx input it its not edited on the opencode side of things.

TheSpaceGod avatar Oct 18 '25 21:10 TheSpaceGod

Side note, tools like cline have tried to reformat their system prompt in "low context window" environments by strategically removing some features of their system prompt ensure the core functionality still has enough context window to fit. Might be something to consider with opencode paired with ollama too.

TheSpaceGod avatar Oct 18 '25 21:10 TheSpaceGod

Interesting, no one on our team uses ollama so this may be something best done by someone more familiar with ollama

The system part makes sense we already do some customization per provider, adding something specifically for ollama would be a good idea: https://github.com/sst/opencode/blob/c8898463a7f67ab07ee034422dc10b0a2ba5defe/packages/opencode/src/session/system.ts#L25

I will look up how to set num_ctx via api call and then update our docs accordingly

As for setting it in the tui itself, that one will prolly be best to address later on since we are in the midst of a rewrite and Im not sure where that would fit it currently.

If you or anyone else would like to make a PR for the ollama system prompt guide you are welcome to I think current system prompt is ~2k tokens

rekram1-node avatar Oct 19 '25 05:10 rekram1-node

I will look up how to set num_ctx via api call and then update our docs accordingly

According to Ollama's docs, in the API call you can add the option:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is the sky blue?",
  "options": {
    "num_ctx": 4096
  }
}'

Pablo1107 avatar Nov 04 '25 21:11 Pablo1107

I don't think it works via openai compat endpoint tho right @Pablo1107

rekram1-node avatar Nov 04 '25 21:11 rekram1-node

I don't think it works via openai compat endpoint tho right @Pablo1107

Oh, you're right, the docs I send was Ollama API spec. My bad.

Edit: the docs explicitly states that the OpenAI compat API cannot be used to set the num_ctx sadly. :(

Maybe we can upstream this issue to Ollama to ask for that feature.

Pablo1107 avatar Nov 04 '25 23:11 Pablo1107

Apparently there was an effort to make that possible here but was abandoned.

And there's also this issue which I think was wrongly closed as completed while it the solution has nothing to do with OpenAI endpoints.

Pablo1107 avatar Nov 04 '25 23:11 Pablo1107

does it have to be set in every request? Or can u set it like on off, because we could do one offs, but every request would be harder

I think there is an ollama ai sdk provider so maybe that could work

rekram1-node avatar Nov 05 '25 00:11 rekram1-node

does it have to be set in every request? Or can u set it like on off, because we could do one offs, but every request would be harder

I think there is an ollama ai sdk provider so maybe that could work

It's hacky but I would argue that if you made the first request to load the model the parameter should be set until it's unloaded. I would not advise going this route since the model it's by default offloaded after some minutes.

Pablo1107 avatar Nov 05 '25 00:11 Pablo1107

yeah that sounds too hacky

rekram1-node avatar Nov 05 '25 01:11 rekram1-node