codex icon indicating copy to clipboard operation
codex copied to clipboard

Codex with gpt-oss:120b model appears to forget the task

Open zhangwen0411 opened this issue 3 months ago • 13 comments

What version of Codex is running?

codex-cli 0.21.0

Which model were you using?

gpt-oss:120b

What platform is your computer?

Darwin 24.6.0 arm64 arm

What steps can reproduce the bug?

Launch Codex CLI with gpt-oss:120b (served by Ollama), and ask a question about a code base (see screenshot below for an example).

What is the expected behavior?

My question answered.

What do you see instead?

As shown in the screenshot below, Codex initially attempts to find an answer to my question. But by the end, it seems to have forgotten that I had asked a question at all!

The last two turns show:

thinking We have a huge repo with many imports. The user hasn't asked a question yet. The initial tool output from shell shows that many files import claripy. Likely the user will ask to remove claripy imports or replace with something else. But we need to see the prompt: The user hasn't given a request. Maybe next message will contain a task. We need to wait for user input.

codex I’m ready for your next instruction!

Image

Additional information

No response

zhangwen0411 avatar Aug 14 '25 08:08 zhangwen0411

Theory: Could it be too short context in ollama from codex cli?

Why do I think that?

If I run gpt-oss:20b, long documents also de-rail codex (ubuntu linux machine).

"ollama ps" yields the following information regarding the model (loaded by codex upon start):

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:20b f2b8351c629c 16 GB 100% GPU 8192 4 minutes from now

So, it seems the context is only 8192 which is too small. Is there a way to enforce larger context?

earl-of-embedding avatar Aug 14 '25 16:08 earl-of-embedding

You're absolutely right! The Ollama FAQ states:

The gpt-oss model has a default context window size of 8192 tokens.

And it gives a few ways to increase the context window size.

Perhaps Codex CLI should issue a warning if Ollama's context window size is too small? The default behavior (with OSS model loaded by Codex) can confuse users.

zhangwen0411 avatar Aug 14 '25 22:08 zhangwen0411

I am still scratching my head.

https://github.com/openai/codex/blob/main/codex-rs/core/src/openai_model_info.rs clearly states that 128k should be used for context:

pub(crate) fn get_model_info(model_family: &ModelFamily) -> Option<ModelInfo> {

    let slug = model_family.slug.as_str();
    match slug {

        // OSS models have a 128k shared token pool.
        // Arbitrarily splitting it: 3/4 input context, 1/4 output.
        // https://openai.com/index/gpt-oss-model-card/
        "gpt-oss-20b" => Some(ModelInfo {
            context_window: 96_000,
            max_output_tokens: 32_000,

        }),
        "gpt-oss-120b" => Some(ModelInfo {
            context_window: 96_000,
            max_output_tokens: 32_000,

        }),

earl-of-embedding avatar Aug 14 '25 23:08 earl-of-embedding

I briefly looked into the uses of the context_window config parameter. It appears this number is used only in TUI components:

https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/chatwidget.rs#L191-L199

https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/chatwidget.rs#L733-L740

https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/bottom_pane/mod.rs#L223-L232

https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/bottom_pane/chat_composer.rs#L129-L140

https://github.com/openai/codex/blob/5552688621340c7042a9240a9bedc864242c5ee7/codex-rs/tui/src/bottom_pane/chat_composer.rs#L672-L687

So it appears the configured context_window is not passed to Ollama, which serves the OSS model.

zhangwen0411 avatar Aug 15 '25 03:08 zhangwen0411

You are right. In the meantime, I tried to correct this behavior in my local fork of codex-rs.

With a lot of tinkering I got "ollama ps" sometimes to show the desired context length as set (I tried both larger and shorter contexts). However, codex-rs is surprisingly persistent. When I use the TUI, somehow it always stops the existing model and replaces it with another one - once more with default context length of 8k.

Likely that is a quick fix for someone seasoned in ollama use but I am out of my league. :-(

earl-of-embedding avatar Aug 15 '25 15:08 earl-of-embedding

I poked at this a while with my gpt-oss-20b model. I used ollama to make a 32k context for it, but it still didn't help. It's unclear what makes it "forget": chatgpt was theorizing it was a streaming bug from ollama, but it also "forgets" when I use exec commands.

tunesmith avatar Aug 19 '25 04:08 tunesmith

Curious. When you do "ollama ps" during codex operation, is the model always with 32k or does it sometimes 'switch back' to 8K?

earl-of-embedding avatar Aug 19 '25 09:08 earl-of-embedding

Here's what I ended up doing:

  • On the Ollama side, to set a larger context, I created a Modelfile file like this:
    FROM gpt-oss:120b
    PARAMETER num_ctx 131072
    
    I then created a new model in Ollama like this:
    ollama create gpt-oss:120b-128k -f ./Modelfile
    
  • On the Codex CLI side, I added the following to ~/.codex/config.toml:
    [model_providers.ollama]
    name = "Ollama"
    base_url = "http://ollama-server-address:11434/v1"
    
    [profiles.gpt-oss-120b-128k]
    model_provider = "ollama"
    model = "gpt-oss:120b-128k"
    
    I then launch Codex CLI using codex exec --profile gpt-oss-120b-128k.

This seems to be working for me!

zhangwen0411 avatar Aug 20 '25 07:08 zhangwen0411

I did something similar with gpt-oss:20b , creating a 32k model, and my request still gets forgotten, as it does with the vanilla codex --oss with ollama serve running. This is on an M3 Pro with 36g memory.

tunesmith avatar Aug 25 '25 21:08 tunesmith

When I use the LMStudio server instead of ollama, and load gpt-oss-20b there with expanded context, initial indications from codex cli look good; I've had some complete exchanges with changes and explanatory responses. So it's possible this "forgetting" is an ollama bug, or at least some integration between codex and ollama that could be improved.

tunesmith avatar Aug 25 '25 22:08 tunesmith

Just for other readers of this thread:

I had success with running llamacpp with full 128k of both the 20B and the 120B model of GPT-OSS-x20B (quad 3090 system here). This documentation was very helpful and it works nicely with codex: https://github.com/ggml-org/llama.cpp/discussions/15396

I could not get ollama to work, the inference became very slow for >40k context and it moved eveything to CPU.

PS: for sake of completeness, these commands work on my quad 3090 system:

./llama-server -hf ggml-org/gpt-oss-20b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -ngl 99 -fa ./llama-server -hf ggml-org/gpt-oss-120b-GGUF --ctx-size 0 --jinja -ub 2048 -b 2048 -ngl 99 -fa

Both(!) result in slightly above 60 tokens per second and give full context. For large context the tps count reduces significantly, I think to about 15 tps for very long contexts.

earl-of-embedding avatar Aug 26 '25 19:08 earl-of-embedding

@zhangwen0411 Thanks for providing the workaround. There is also a simpler way to do it. Run ollama show:

ollama show gpt-oss:20b
  Model
    architecture        gptoss
    parameters          20.9B
    context length      131072
    embedding length    2880
    quantization        MXFP4

  Capabilities
    completion
    tools
    thinking

  Parameters
    temperature    1

  License
    Apache License
    Version 2.0, January 2004
    ...

Note: Then using the context length parameter provided you can see what is the upper limit. I found out for myself that using LLMs on full context on my MacBook Pro with M3 Pro and 36 GB of RAM makes them ultra-> slow and makes my Mac get starved to death from RAM given I am also running JetBrains IDEs:

ollama ps
NAME                ID              SIZE     PROCESSOR          CONTEXT    UNTIL
gpt-oss:128k-20b    2d8df491533b    34 GB    16%/84% CPU/GPU    131072     4 minutes from now

Then run ollama run gpt-oss:20b from your terminal:

ollama run gpt-oss:20b
>>> /?
Available Commands:
  /set            Set session variables
  /show           Show model information
  /load <model>   Load a session or model
  /save <model>   Save your current session
  /clear          Clear session context
  /bye            Exit
  /?, /help       Help for a command
  /? shortcuts    Help for keyboard shortcuts

Use """ to begin a multi-line message.

>>> /set parameter
Available Parameters:
  /set parameter seed <int>             Random number seed
  /set parameter num_predict <int>      Max number of tokens to predict
  /set parameter top_k <int>            Pick from top k num of tokens
  /set parameter top_p <float>          Pick token based on sum of probabilities
  /set parameter min_p <float>          Pick token based on top token probability * min_p
  /set parameter num_ctx <int>          Set the context size
  /set parameter temperature <float>    Set creativity level
  /set parameter repeat_penalty <float> How strongly to penalize repetitions
  /set parameter repeat_last_n <int>    Set how far back to look for repetitions
  /set parameter num_gpu <int>          The number of layers to send to the GPU
  /set parameter stop <string> <string> ...   Set the stop parameters

>>> /set parameter num_ctx 31072
Set parameter 'num_ctx' to '31072'
>>> /save gpt-oss:32k-20b
Created new model 'gpt-oss:32k-20b'

When running the 32k model my memory usage is acceptable:

NAME               ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:32k-20b    e83f14b6f6fb    18 GB    100% GPU     31072      4 minutes from now

After running the command that sets the context length I need to save it as seperate model, it should run with this context length regardless of environment, it works for me like a charm and is simpler for me than creating seperate ModelFile.

Mondonno avatar Aug 29 '25 07:08 Mondonno

Hi has anyone got it to successfully work with vllm?

jayteaftw avatar Aug 29 '25 13:08 jayteaftw

@jayteaftw I think on vllm might be that not all endpoints support streaming https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#harmony-format-support I've tried both response api and chat api but alas I can't get it to work with function calling Also I can see that request came through just that nothing was generated

(APIServer pid=1) INFO 09-17 13:03:06 [api_server.py:1545] response_body={streaming_complete: no_content, chunks=47}                                                     
(APIServer pid=1) INFO 09-17 13:03:12 [loggers.py:123] Engine 000: Avg prompt throughput: 767.7 tokens/s, Avg generation throughput: 5.7 tokens/s, Running: 0 reqs, Waiti
(APIServer pid=1) INFO 09-17 13:03:22 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting

Here is relevant thread on reddit https://www.reddit.com/r/LocalLLaMA/comments/1n0h6l7/gptoss_tools_calling_is_not_working_with_vllm/ it seems that people had luck with response api (presumably without streaming).

Hnatekmar avatar Sep 17 '25 20:09 Hnatekmar

Hi has anyone got it to successfully work with vllm?

@jayteaftw, not me. I have been trying the use the vllm container with the gpt-oss-20b model and tool calling is just not working at all. You can run prompts like "What are you?" and it will call the model and respond. But as soon as you ask it to run a command, codex just does nothing and returns no output. Things work fine with updated llama.cpp serving gpt-oss models with codex cli. The problem seems to be with vllm itslf, not codex:

  • https://github.com/openai/codex/issues/2565#issuecomment-3240190654

?

Codex and vllm are not shaking hands correctly somehow.

bartlettroscoe avatar Sep 19 '25 19:09 bartlettroscoe

Since this also seems to be a problem with the gpt-oss-20b (gpt-oss:20b) model, can we update title of this issue from:

Codex with gpt-oss:120b model appears to forget the task

to

Codex with gpt-oss models appear to forget the task

?

bartlettroscoe avatar Sep 22 '25 20:09 bartlettroscoe

From https://github.com/openai/codex/issues/2257#issuecomment-3319825535

I think it is VLLM. In VLLM's guide for deploying GPT OSS it shows that Response API with Streaming does not support function calling. So if Codex, is calling that api path, then I think that would be the problem

Is there a way to configure codex to use the batch "Response API" mode so we can use vLLM-served gpt-oss-120b model? The "Response API" seems to be supported with vLLM as per:

  • https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#harmony-format-support

How did OpenAI not make sure that vLLM was updated to serve gpt-oss-120b model in a way that Codex CLI could use it? Yes, you can run the gpt-oss-20b model with llama.cpp very well (even excellently), but to seriously serve the gpt-oss-120b model you need a serious GPU like an H100 where llama.cpp does not seem to work (see https://github.com/ggml-org/llama.cpp/issues/15112), so you really have to be serving this model with vLLM. We have this great model and we can't use it for anything real. How did this happen?

bartlettroscoe avatar Sep 23 '25 00:09 bartlettroscoe

@bartlettroscoe I haven't found any then again haven't had much time to experiment.

Maybe it could work with sglang but then again haven't tested it yet...

Also llama.cpp is limited by concurrent requests (AFAIK it supports them but you need to split context size by number of parallel request) so vllm would be better for production deployment (if it worked that is)

Hnatekmar avatar Sep 23 '25 06:09 Hnatekmar

Also llama.cpp is limited by concurrent requests (AFAIK it supports them but you need to split context size by number of parallel request) so vllm would be better for production deployment (if it worked that is)

FYI: I am looking at flexllama as a possible option to serve gpt-oss-120b on H100s to a hand-full of customers (assuming we can get llama.cpp to server the gpt-oss-120b model on our H100s).

bartlettroscoe avatar Sep 23 '25 17:09 bartlettroscoe

+1 i have the same behaviour with gpt-oss + the last release of vllm + litellm on 2xH100. I have 128k context.

dotmobo avatar Oct 05 '25 19:10 dotmobo

same behavior with gpt-oss-120b on sglang (hosted on our 2x rtx 6000 ada server) with 64k context window

edv-sml avatar Oct 06 '25 16:10 edv-sml

+1

thomasWos avatar Oct 06 '25 23:10 thomasWos

FYI: We have have a successful demo using llama-swap to serve the gpt-oss-120b model on multiple GPUs (H100s) from a single endpoint IP:port. That is good enough for our research and benchmarking use cases. But this will not work for our customers who are using vLLM.

Anyone working on this?

bartlettroscoe avatar Oct 08 '25 16:10 bartlettroscoe

also observing this issue in codex cli when trying to access gpt-oss-120b running in vllm (on a separate nvidia-dgx-spark device) . Other client tools like cline, cline-cli have no trouble accessing this same model hosted by vllm and using tools etc, so it is not the VLLM configuration. Cline is a little slanted to Anthropic conventions though so I was hoping for better tool use in OpenAI's own tool talking to their own open source model but so far no luck! The most I can get out of codex with the vllm model is a short answer without context (like "short poem on python"), then it forgets everything....

cboettig avatar Nov 07 '25 22:11 cboettig

Other client tools like cline, cline-cli have no trouble accessing this same model hosted by vllm and using tools etc,

@cboettig, those agents may be using the batch non-streaming "Response API" for vLLM gpt-oss which is documented to work at:

  • https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#harmony-format-support

The issue is that vLLM did not support the "Response API with Streaming" which is what codex cli seems to be using.

However, with the merge of the vLLM branch just a couple days ago:

  • https://github.com/vllm-project/vllm/pull/26874

that might be resolved (but I have not verified that yet).

We are getting along fine with llama.cpp serving gpt-oss models with codex cli for our project for a long time.

bartlettroscoe avatar Nov 08 '25 00:11 bartlettroscoe

that might be resolved (but I have not verified that yet).

@bartlettroscoe have you had an opportunity to check if it’s actually fixed?

Im facing the same problem apparently.

andresssantos avatar Nov 19 '25 19:11 andresssantos

@bartlettroscoe have you had an opportunity to check if it’s actually fixed?

@andresssantos, I have not. I was hoping someone else would have time to do it. We are getting by just fine with llama-swap/llama.cpp with gpt-oss-120b for now.

bartlettroscoe avatar Nov 19 '25 22:11 bartlettroscoe

The latest version of the CLI (0.59.0) includes support for LM Studio (in addition to Ollama). See this documentation for configuration details. LM Studio has support for the stateful "responses" endpoint, which was designed for modern reasoning models like gpt-oss, so you may find that you get better results. Let us know what you find.

etraut-openai avatar Nov 19 '25 23:11 etraut-openai