codex
codex copied to clipboard
Codex with gpt-oss:120b model appears to forget the task
What version of Codex is running?
codex-cli 0.21.0
Which model were you using?
gpt-oss:120b
What platform is your computer?
Darwin 24.6.0 arm64 arm
What steps can reproduce the bug?
Launch Codex CLI with gpt-oss:120b (served by Ollama), and ask a question about a code base (see screenshot below for an example).
What is the expected behavior?
My question answered.
What do you see instead?
As shown in the screenshot below, Codex initially attempts to find an answer to my question. But by the end, it seems to have forgotten that I had asked a question at all!
The last two turns show:
thinking We have a huge repo with many imports. The user hasn't asked a question yet. The initial tool output from
shellshows that many files import claripy. Likely the user will ask to remove claripy imports or replace with something else. But we need to see the prompt: The user hasn't given a request. Maybe next message will contain a task. We need to wait for user input.codex I’m ready for your next instruction!
Additional information
No response
Theory: Could it be too short context in ollama from codex cli?
Why do I think that?
If I run gpt-oss:20b, long documents also de-rail codex (ubuntu linux machine).
"ollama ps" yields the following information regarding the model (loaded by codex upon start):
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:20b f2b8351c629c 16 GB 100% GPU 8192 4 minutes from now
So, it seems the context is only 8192 which is too small. Is there a way to enforce larger context?
You're absolutely right! The Ollama FAQ states:
The gpt-oss model has a default context window size of 8192 tokens.
And it gives a few ways to increase the context window size.
Perhaps Codex CLI should issue a warning if Ollama's context window size is too small? The default behavior (with OSS model loaded by Codex) can confuse users.
I am still scratching my head.
https://github.com/openai/codex/blob/main/codex-rs/core/src/openai_model_info.rs clearly states that 128k should be used for context:
pub(crate) fn get_model_info(model_family: &ModelFamily) -> Option<ModelInfo> {
let slug = model_family.slug.as_str();
match slug {
// OSS models have a 128k shared token pool.
// Arbitrarily splitting it: 3/4 input context, 1/4 output.
// https://openai.com/index/gpt-oss-model-card/
"gpt-oss-20b" => Some(ModelInfo {
context_window: 96_000,
max_output_tokens: 32_000,
}),
"gpt-oss-120b" => Some(ModelInfo {
context_window: 96_000,
max_output_tokens: 32_000,
}),
I briefly looked into the uses of the context_window config parameter. It appears this number is used only in TUI components:
https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/chatwidget.rs#L191-L199
https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/chatwidget.rs#L733-L740
https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/bottom_pane/mod.rs#L223-L232
https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/bottom_pane/chat_composer.rs#L129-L140
https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/bottom_pane/chat_composer.rs#L672-L687
So it appears the configured context_window is not passed to Ollama, which serves the OSS model.
You are right. In the meantime, I tried to correct this behavior in my local fork of codex-rs.
With a lot of tinkering I got "ollama ps" sometimes to show the desired context length as set (I tried both larger and shorter contexts). However, codex-rs is surprisingly persistent. When I use the TUI, somehow it always stops the existing model and replaces it with another one - once more with default context length of 8k.
Likely that is a quick fix for someone seasoned in ollama use but I am out of my league. :-(
I poked at this a while with my gpt-oss-20b model. I used ollama to make a 32k context for it, but it still didn't help. It's unclear what makes it "forget": chatgpt was theorizing it was a streaming bug from ollama, but it also "forgets" when I use exec commands.
Curious. When you do "ollama ps" during codex operation, is the model always with 32k or does it sometimes 'switch back' to 8K?
Here's what I ended up doing:
- On the Ollama side, to set a larger context, I created a
Modelfilefile like this:
I then created a new model in Ollama like this:FROM gpt-oss:120b PARAMETER num_ctx 131072ollama create gpt-oss:120b-128k -f ./Modelfile - On the Codex CLI side, I added the following to
~/.codex/config.toml:
I then launch Codex CLI using[model_providers.ollama] name = "Ollama" base_url = "http://ollama-server-address:11434/v1" [profiles.gpt-oss-120b-128k] model_provider = "ollama" model = "gpt-oss:120b-128k"codex exec --profile gpt-oss-120b-128k.
This seems to be working for me!
I did something similar with gpt-oss:20b , creating a 32k model, and my request still gets forgotten, as it does with the vanilla codex --oss with ollama serve running. This is on an M3 Pro with 36g memory.
When I use the LMStudio server instead of ollama, and load gpt-oss-20b there with expanded context, initial indications from codex cli look good; I've had some complete exchanges with changes and explanatory responses. So it's possible this "forgetting" is an ollama bug, or at least some integration between codex and ollama that could be improved.
Just for other readers of this thread:
I had success with running llamacpp with full 128k of both the 20B and the 120B model of GPT-OSS-x20B (quad 3090 system here). This documentation was very helpful and it works nicely with codex: https://github.com/ggml-org/llama.cpp/discussions/15396
I could not get ollama to work, the inference became very slow for >40k context and it moved eveything to CPU.
PS: for sake of completeness, these commands work on my quad 3090 system:
./llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -ngl 99 -fa ./llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -ngl 99 -fa
Both(!) result in slightly above 60 tokens per second and give full context. For large context the tps count reduces significantly, I think to about 15 tps for very long contexts.
@zhangwen0411 Thanks for providing the workaround.
There is also a simpler way to do it.
Run ollama show:
ollama show gpt-oss:20b
Model
architecture gptoss
parameters 20.9B
context length 131072
embedding length 2880
quantization MXFP4
Capabilities
completion
tools
thinking
Parameters
temperature 1
License
Apache License
Version 2.0, January 2004
...
Note: Then using the
context lengthparameter provided you can see what is the upper limit. I found out for myself that using LLMs on full context on my MacBook Pro with M3 Pro and 36 GB of RAM makes them ultra-> slow and makes my Mac get starved to death from RAM given I am also running JetBrains IDEs:ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:128k-20b 2d8df491533b 34 GB 16%/84% CPU/GPU 131072 4 minutes from now
Then run ollama run gpt-oss:20b from your terminal:
ollama run gpt-oss:20b
>>> /?
Available Commands:
/set Set session variables
/show Show model information
/load <model> Load a session or model
/save <model> Save your current session
/clear Clear session context
/bye Exit
/?, /help Help for a command
/? shortcuts Help for keyboard shortcuts
Use """ to begin a multi-line message.
>>> /set parameter
Available Parameters:
/set parameter seed <int> Random number seed
/set parameter num_predict <int> Max number of tokens to predict
/set parameter top_k <int> Pick from top k num of tokens
/set parameter top_p <float> Pick token based on sum of probabilities
/set parameter min_p <float> Pick token based on top token probability * min_p
/set parameter num_ctx <int> Set the context size
/set parameter temperature <float> Set creativity level
/set parameter repeat_penalty <float> How strongly to penalize repetitions
/set parameter repeat_last_n <int> Set how far back to look for repetitions
/set parameter num_gpu <int> The number of layers to send to the GPU
/set parameter stop <string> <string> ... Set the stop parameters
>>> /set parameter num_ctx 31072
Set parameter 'num_ctx' to '31072'
>>> /save gpt-oss:32k-20b
Created new model 'gpt-oss:32k-20b'
When running the 32k model my memory usage is acceptable:
NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:32k-20b e83f14b6f6fb 18 GB 100% GPU 31072 4 minutes from now
After running the command that sets the context length I need to save it as seperate model, it should run with this context length regardless of environment, it works for me like a charm and is simpler for me than creating seperate ModelFile.
Hi has anyone got it to successfully work with vllm?
@jayteaftw I think on vllm might be that not all endpoints support streaming https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#harmony-format-support I've tried both response api and chat api but alas I can't get it to work with function calling Also I can see that request came through just that nothing was generated
(APIServer pid=1) INFO 09-17 13:03:06 [api_server.py:1545] response_body={streaming_complete: no_content, chunks=47}
(APIServer pid=1) INFO 09-17 13:03:12 [loggers.py:123] Engine 000: Avg prompt throughput: 767.7 tokens/s, Avg generation throughput: 5.7 tokens/s, Running: 0 reqs, Waiti
(APIServer pid=1) INFO 09-17 13:03:22 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting
Here is relevant thread on reddit https://www.reddit.com/r/LocalLLaMA/comments/1n0h6l7/gptoss_tools_calling_is_not_working_with_vllm/ it seems that people had luck with response api (presumably without streaming).
Hi has anyone got it to successfully work with vllm?
@jayteaftw, not me. I have been trying the use the vllm container with the gpt-oss-20b model and tool calling is just not working at all. You can run prompts like "What are you?" and it will call the model and respond. But as soon as you ask it to run a command, codex just does nothing and returns no output. Things work fine with updated llama.cpp serving gpt-oss models with codex cli. The problem seems to be with vllm itslf, not codex:
- https://github.com/openai/codex/issues/2565#issuecomment-3240190654
?
Codex and vllm are not shaking hands correctly somehow.
Since this also seems to be a problem with the gpt-oss-20b (gpt-oss:20b) model, can we update title of this issue from:
Codex with gpt-oss:120b model appears to forget the task
to
Codex with gpt-oss models appear to forget the task
?
From https://github.com/openai/codex/issues/2257#issuecomment-3319825535
I think it is VLLM. In VLLM's guide for deploying GPT OSS it shows that
Response API with Streamingdoes not support function calling. So if Codex, is calling that api path, then I think that would be the problem
Is there a way to configure codex to use the batch "Response API" mode so we can use vLLM-served gpt-oss-120b model? The "Response API" seems to be supported with vLLM as per:
- https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#harmony-format-support
How did OpenAI not make sure that vLLM was updated to serve gpt-oss-120b model in a way that Codex CLI could use it? Yes, you can run the gpt-oss-20b model with llama.cpp very well (even excellently), but to seriously serve the gpt-oss-120b model you need a serious GPU like an H100 where llama.cpp does not seem to work (see https://github.com/ggml-org/llama.cpp/issues/15112), so you really have to be serving this model with vLLM. We have this great model and we can't use it for anything real. How did this happen?
@bartlettroscoe I haven't found any then again haven't had much time to experiment.
Maybe it could work with sglang but then again haven't tested it yet...
Also llama.cpp is limited by concurrent requests (AFAIK it supports them but you need to split context size by number of parallel request) so vllm would be better for production deployment (if it worked that is)
Also llama.cpp is limited by concurrent requests (AFAIK it supports them but you need to split context size by number of parallel request) so vllm would be better for production deployment (if it worked that is)
FYI: I am looking at flexllama as a possible option to serve gpt-oss-120b on H100s to a hand-full of customers (assuming we can get llama.cpp to server the gpt-oss-120b model on our H100s).
+1 i have the same behaviour with gpt-oss + the last release of vllm + litellm on 2xH100. I have 128k context.
same behavior with gpt-oss-120b on sglang (hosted on our 2x rtx 6000 ada server) with 64k context window
+1
FYI: We have have a successful demo using llama-swap to serve the gpt-oss-120b model on multiple GPUs (H100s) from a single endpoint IP:port. That is good enough for our research and benchmarking use cases. But this will not work for our customers who are using vLLM.
Anyone working on this?
also observing this issue in codex cli when trying to access gpt-oss-120b running in vllm (on a separate nvidia-dgx-spark device) . Other client tools like cline, cline-cli have no trouble accessing this same model hosted by vllm and using tools etc, so it is not the VLLM configuration. Cline is a little slanted to Anthropic conventions though so I was hoping for better tool use in OpenAI's own tool talking to their own open source model but so far no luck! The most I can get out of codex with the vllm model is a short answer without context (like "short poem on python"), then it forgets everything....
Other client tools like cline, cline-cli have no trouble accessing this same model hosted by vllm and using tools etc,
@cboettig, those agents may be using the batch non-streaming "Response API" for vLLM gpt-oss which is documented to work at:
- https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#harmony-format-support
The issue is that vLLM did not support the "Response API with Streaming" which is what codex cli seems to be using.
However, with the merge of the vLLM branch just a couple days ago:
- https://github.com/vllm-project/vllm/pull/26874
that might be resolved (but I have not verified that yet).
We are getting along fine with llama.cpp serving gpt-oss models with codex cli for our project for a long time.
that might be resolved (but I have not verified that yet).
@bartlettroscoe have you had an opportunity to check if it’s actually fixed?
Im facing the same problem apparently.
@bartlettroscoe have you had an opportunity to check if it’s actually fixed?
@andresssantos, I have not. I was hoping someone else would have time to do it. We are getting by just fine with llama-swap/llama.cpp with gpt-oss-120b for now.
The latest version of the CLI (0.59.0) includes support for LM Studio (in addition to Ollama). See this documentation for configuration details. LM Studio has support for the stateful "responses" endpoint, which was designed for modern reasoning models like gpt-oss, so you may find that you get better results. Let us know what you find.