codex icon indicating copy to clipboard operation
codex copied to clipboard

30k tokens per minute with Tier 1 API key

Open ReyNeill opened this issue 3 months ago • 37 comments

What version of Codex is running?

latest

Which model were you using?

gpt-5

What platform is your computer?

macOS

What steps can reproduce the bug?

  • Start any conversation with gpt-5 with an API key of a Tier 1 verified organization

Issue

It's literally impossible to use gpt-5 on the CLI with a Tier 1 account:

"🖐 stream disconnected before completion: Request too large for gpt-5 in organization org-IwwmbIkguifYieViu0oMJ8Uv on tokens per min (TPM): Limit 30000, Requested 30237. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/ account/rate-limits to learn more."

This MUST be unintended. If it's not, then how come a user need to consume tokens via API if it's not thinking GPT-5 (the objectively best and latest model). Plus pricing and efficiency of inference of gpt-5 being outstanding from OpenAI. I'm still paying for my tokens!

Maybe it was just never considered/revisited and rate-limit changes updates must be applied. Thanks for your attention to this matter.

ReyNeill avatar Aug 23 '25 22:08 ReyNeill

Same issue, waiting 10 minutes just to see the tokens I paid for were not enough and need to start over praying this time my money won't be flushed in the toilet.

Just to clarify: Even simple task "remove button XXX from YYYY.php file" is taking up 30000+ tokens.

Wydoyolo avatar Aug 24 '25 08:08 Wydoyolo

I also received this response:

Image

And can confirm signing in via ChatGPT does not work based on API keys and rate limiting applies on subscription. There's a lot of potential new customers being lost because of this.

ReyNeill avatar Aug 24 '25 20:08 ReyNeill

can you make sure you're on the latest version, then tell me what you see in /status after signing in with chatgpt on a plus/pro/team account? you shouldn't see that limit

ae-openai avatar Aug 24 '25 21:08 ae-openai

@ae-openai They want to use the API instead of subscription login, but since they created a new org they only have a Tier-1 API key. That means they’re limited to 30k tokens per minute and in turn limited to 30k in a single request Unless you have a Tier-3 or Tier-4 org, using Codex for coding is basically impossible.

mrqs001 avatar Aug 24 '25 23:08 mrqs001

Yes, i cant do the simplest task with codex as a tier one user. 30k tokens is just not enough

stef4ik avatar Aug 25 '25 00:08 stef4ik

> can you make sure you're on the latest version, then tell me what you see in /status after signing in with chatgpt on a plus/pro/team account? you shouldn't see that limit @ae-openai

This is in regards to tier 1 organization API key usage (per the title of the issue), NOT for a plus/pro/team account.

@ae-openai this has been an issue going back to at least April 16, 2025. At least a hundred users have repeatedly reported that Codex suddenly crashes when it exceeds organization-specific tokens-per-minute (TPM) limits (30,000 TPM for GPT‑5 for tier 1 organizations via the API) without retrying or preserving session context.

There’s a long, long trail of rate-limit/crash reports. Nearly all are closed, often without clear remediation notes, and always with users reporting this has not been resolved: https://github.com/openai/codex/issues/88, https://github.com/openai/codex/issues/97, https://github.com/openai/codex/issues/137, https://github.com/openai/codex/issues/157, https://github.com/openai/codex/issues/233, https://github.com/openai/codex/issues/246, https://github.com/openai/codex/issues/303, https://github.com/openai/codex/issues/326, https://github.com/openai/codex/issues/334, https://github.com/openai/codex/issues/353, https://github.com/openai/codex/issues/451, https://github.com/openai/codex/issues/553, https://github.com/openai/codex/issues/690, https://github.com/openai/codex/issues/691, https://github.com/openai/codex/issues/1115, https://github.com/openai/codex/issues/2022, https://github.com/openai/codex/issues/2131, and https://github.com/openai/codex/issues/2448: When a tier 1 organization logs into Codex CLI via the API, Codex CLI crashes when hitting rate limits instead of honoring retry-after instructions or preserving session context. Every time, maintainers like @pakrym-oai have closed every single one of these issues without additional explanation, follow-up, or acknowledgment that the issue persists, despite continued user reports. Users have made the issue abundantly clear in the dozens of published issues, and the responses to maintainers like @pakrym-oai. I like to give the benefit of the doubt, but I am having trouble here understanding how this could be anything but gross negligence, unless this is a hidden policy choice from upper leadership that has purposefully been obscured from users.

This pattern is deeply troubling and erodes trust. Trust not just in Codex CLI itself, but in the integrity of OpenAI’s support for its own developer tools. Could leadership clarify whether this is being actively addressed? Or what priorities or constraints are behind repeatedly closing unresolved issues? Confidence that the maintainers' decisions aren’t perceived as intentional dismissal or silent ignoring of legitimate user concerns is essential for developer's confidence in the platform.

@ae-openai, we need explicit clarity and transparency on this.

@tibo-openai @wojzaremba @codex-maintainers

https://github.com/openai/codex/issues/2149

mdbro avatar Aug 25 '25 05:08 mdbro

Started using Codex with API key and all I have been able to do is /init. Everything else is bailing out with request too large TPM limits. I was hoping Codex could be the answer to some problems I have had with a similar 'Code' competitor.

digitalmagnets avatar Aug 25 '25 08:08 digitalmagnets

⚠️ Problem: Context Bloat (aka Context Window Explosion) in Codex/LLMs

Suppose each file has ~10k tokens.
There are two possible patterns for how context grows:


1. Sequential (linear cost)

If you only load the current file (or a snapshot/summary of it) at each step:

  • File 1 → context = 10k, cumulative usage = 10k
  • File 2 → context = 10k, cumulative usage = 20k
  • File 3 → context = 10k, cumulative usage = 30k
  • File 4 → context = 10k, cumulative usage = 40k

➡️ Per query: constant 10k
➡️ Across queries: total = 10k·n → O(n)

Therefore, when a model needs to analyze multiple files at once, it’s better to load all four in parallel to avoid duplication and then return a consolidated snapshot to the main context. If the max context limit is a concern, each file can be processed in a separate chat and the four snapshots merged afterward.

Or you could use a side agent to read even 10 files of 100k tokens each, but do it iteratively by loading only the current 10k at a time and storing snapshots in the main context about where everything is located. In that case, even a 10k token window would be enough to read large files, and at the same time the token usage would drop dramatically.


2. Cumulative (quadratic cost – Gauss triangle)

If you always reload all previous files together with the new one:

  • File 1 → context = 10k, cumulative usage = 10k
  • File 2 → context = 20k (file1 + file2), cumulative usage = 10k + 20k = 30k
  • File 3 → context = 30k (file1 + file2 + file3), cumulative usage = 10k + 20k + 30k = 60k
  • File 4 → context = 40k, cumulative usage = 10k + 20k + 30k + 40k = 100k

This likely explains the large discrepancy between cache tokens and fresh input tokens — caused by repeated reloading of previous context. In practice, only ~40k tokens were newly read, but the cache registers ~100k due to accumulated reuse. Maybe there's also an issue with the cache itself under the hood — like it keeps buffering the same stuff multiple times and falls into a Gauss triangle pattern.

Linear loading scales linearly (~1M tokens for 100 files at 10k each), while cumulative reloading explodes quadratically (~50M tokens), the classic Gauss triangle effect.

General formula:

[ tokens_{total}(n) = 10k \cdot (1+2+\dots+n) = 10k \cdot \frac{n(n+1)}{2} ]

➡️ Per query: grows linearly → O(n)
➡️ Across queries: grows quadratically → O(n²) (Gauss triangle)


✅ Better approach: Iterative Summarization + Retrieval

  • Instead of carrying entire files in context every time, generate summaries or snapshots per file.
  • At query time, only load the minimal, relevant snapshots into context.
  • Optionally, leverage auxiliary models for preprocessing/parallel retrieval to reduce main context usage.
  • Optionally, store embeddings in a small local vector DB (indexed).
    → Retrieve fragments on demand without quadratic token growth.

Possible mitigations include: RAG (Retrieval Augmented Generation) for selective loading, hierarchical summarization (summary → detail on demand), and sliding windows to keep only the last N elements in context.

💡 In practice:

  • If you design the pipeline well, you stay in the linear regime.
  • If you naïvely reload everything each time, you hit the quadratic Gauss triangle explosion.

I guess the 30k limit comes precisely from this: during the agent era, such behavior from many users would overload the API, since even when you just write something short like ‘ok’ about 10 times, the client is still re-sending the same 200k tokens each time — which is easy to hit if you load entire files without any control.

This isn’t a direct API fix, but a design pattern to avoid quadratic context explosion and use the token window efficiently. A real out-of-the-box solution would require changes in the overall Codex architecture — or designing your own agent from scratch.

I’m not sure if I’ve chosen the right topic, but the above addresses the problem of exceeding the maximum context as well as the issue of the limit being used up too quickly on Plus and Pro accounts.

PS. It seems like an incredible coincidence that the problem was suddenly found — and the excessive token counting turned out to be 7–9× higher than expected, of course 👎 :)

AdsQnn avatar Aug 25 '25 17:08 AdsQnn

Likely related to #2697 , #2698 probably closes this issue

snipeship avatar Aug 27 '25 00:08 snipeship

@snipeship Exactly the same thing with the assistant’s repeated message I noticed yesterday.

AdsQnn avatar Aug 27 '25 01:08 AdsQnn

@snipeship Exactly the same thing with the assistant’s repeated message I noticed yesterday.

Consider giving the issue & PR a thumbs up

snipeship avatar Aug 27 '25 01:08 snipeship

I’ll try to sort something out if no one beats me to it :) I’m juggling a few projects at the moment, so we’ll see how it goes :)

AdsQnn avatar Aug 27 '25 01:08 AdsQnn

I’ll try to sort something out if no one beats me to it :) I’m juggling a few projects at the moment, so we’ll see how it goes :)

#2698 beat you to it

snipeship avatar Aug 27 '25 01:08 snipeship

I’ll try to sort something out if no one beats me to it :) I’m juggling a few projects at the moment, so we’ll see how it goes :)

#2698 beat you to it

It’s great that the problem got solved. I’ve got a few interesting ideas for Codex, so maybe something will come up :) I’m glad I could take part in solving the issue, even in a small way. Now that Codex is all good – happy programming! Cheers :)

AdsQnn avatar Aug 27 '25 01:08 AdsQnn

@AdsQnn @snipeship Unfortunately, this specific issue has not been resolved.

I just tested in 0.25.0, so it looks like this specific issue was not caused by the duplication error after all.

_ You are using OpenAI Codex in ~/github/***

/approvals - choose what Codex can do without approval /model - choose what model and reasoning effort to use

• Approval Mode: never • Sandbox: read-only • AGENTS files: AGENTS.md 🧠 Model • Name: gpt-5 • Provider: OpenAI • Reasoning Effort: Medium • Reasoning Summaries: Auto

📊 Token Usage • Session ID: *** • Input: 0 • Output: 0 • Total: 0

user review and summarize codebase

📋 Update plan [░░░░░░░░░░] 0/7 └ □ Scan repo structure and files □ Identify app entry points □ Skim core modules □ Skim services implementations □ Skim GUI layer □ Skim tests and fixtures □ Draft concise summary

codex I’ll scan the repo to map structure and key files.

_ ✓ 🔎 **

codex I’ll list top-level folders and key subfolders to map structure.

_ ✓ 📂 ls -1 ✓ 🔎 .

codex I’ll open README.md and key run scripts to understand usage.

_ ✓ 📖 README.md ⚠  stream error: stream disconnected before completion: Rate limit reached for gpt-5 in organization org-PTmtBZRilMBenZBhKRsepDTZ on tokens per min (TPM): Limit 30000, Used 28328, Requested 13107. Please try again in 22.87s. Visit https://platform.openai.com/account/rate-limits to learn more.; retrying 1/5 in 199ms…

⚠  stream error: stream disconnected before completion: Rate limit reached for gpt-5 in organization org-PTmtBZRilMBenZBhKRsepDTZ on tokens per min (TPM): Limit 30000, Used 27817, Requested 13107. Please try again in 21.848s. Visit https://platform.openai.com/account/rate-limits to learn more.; retrying 2/5 in 434ms…

⚠  stream error: stream disconnected before completion: Rate limit reached for gpt-5 in organization org-PTmtBZRilMBenZBhKRsepDTZ on tokens per min (TPM): Limit 30000, Used 27018, Requested 13107. Please try again in 20.25s. Visit https://platform.openai.com/account/rate-limits to learn more.; retrying 3/5 in 848ms…

⚠  stream error: stream disconnected before completion: Rate limit reached for gpt-5 in organization org-PTmtBZRilMBenZBhKRsepDTZ on tokens per min (TPM): Limit 30000, Used 26059, Requested 13107. Please try again in 18.332s. Visit https://platform.openai.com/account/rate-limits to learn more.; retrying 4/5 in 1.486s…

⚠  stream error: stream disconnected before completion: Rate limit reached for gpt-5 in organization org-PTmtBZRilMBenZBhKRsepDTZ on tokens per min (TPM): Limit 30000, Used 24664, Requested 13107. Please try again in 15.542s. Visit https://platform.openai.com/account/rate-limits to learn more.; retrying 5/5 in 3.117s…

🖐  stream disconnected before completion: Rate limit reached for gpt-5 in organization org-PTmtBZRilMBenZBhKRsepDTZ on tokens per min (TPM): Limit 30000, Used 22468, Requested 13107. Please try again in 11.15s. Visit https://platform.openai.com/account/rate-limits to learn more.

▌ Summarize recent commits ⏎ send Ctrl+J newline Ctrl+T transcript Ctrl+C quit 5920 tokens used 98% context left

mdbro avatar Aug 29 '25 03:08 mdbro

@mdbro

Partially I checked — Codex claims it has snapshots. However, when you have 28k tokens in context and try to load another ~14k, you get an error.

Besides, it’s clear that context trimming is either set higher in the CLI (I haven’t checked if it’s there at all) — or it’s also happening on their side. Because you can clearly see it doesn’t trim down to 30k, which is actually the direct cause of this problem.

This would require adding a max content setting to the CLI.

On X it was mentioned that, for now, a workaround has been applied. As I said, with such a small context window you’d really need your own agent.

The reason this context window is so limited is exactly what I described above. Overall, 30k is very little.

It looks like on their side they’re applying a workaround by creating snapshots — but not inside the CLI. When you call the API, you don’t hit that workaround, so you end up loading full files. Try uploading two small files for testing, and then ask the model whether it has the entire files in context, or just snapshots.

So the minimal design pattern to solve the problem should look as follows: agent_main, agent_get — which iteratively reloads only the relevant context, and with each iteration creates snapshots and returns them to the main context. Then, a sliding window set by max-content for both. All of this implemented directly in the CLI.

I would also consider adding an agent_set, which would allow for more granular and selective modifications — this, in turn, would lead to even greater savings of tokens in the main context.

It should be kept in mind that in such a setup, agent_set should have access to agent_get, so as not to route again through the main context

AdsQnn avatar Aug 29 '25 10:08 AdsQnn

@AdsQnn Thanks for clarifying the “context bloat” mechanics. I completely agree that Codex should never be pushing 28k + 14k tokens into a single Tier-1 request, but one point appears to be mixed up: the 30k figure is a rate cap, not a context-window cap. GPT-5 accepts roughly 400k tokens in a single request. Tier-1 orgs may send or receive only 30k tokens per minute across all requests.

When Codex sends 28k tokens of history and then asks for a 14k-token completion, it does not exceed the model’s window. It’s blowing through the minute-bucket, so the call is killed even before inference finishes.

As @AdsQnn pointed out, adding a max-content option that caps the prompt so the prompt + completion total never exceeds the 30k-TPM budget looks like a practical fix.

v0.25.0 still crashes for Tier-1 orgs; no official acknowledgement of a roadmap or design change. The problem stems from Codex CLI’s prompt-construction strategy, not from the GPT-5 model itself, and it remains unresolved despite a long paper trail.

Maintainers (@ae-openai, @pakrym-oai, @codex-maintainers):

  1. Ownership – Will Codex CLI add automatic prompt-trimming, or must every Tier-1 user build an external agent?
  2. Timeline – A clear ETA or roadmap item would stop the seven-month loop of duplicate tickets.

mdbro avatar Aug 29 '25 18:08 mdbro

@mdbro Ah, I thought the API had a limit of 30k. I was mistaken—I didn’t check carefully. That makes things even more difficult, because just two iterations of 15k within 10 seconds are enough, and that’s it. In that case, the only solution is raising the limit—otherwise, it’s impossible to work with it this way.

With a context of over 200k, it’s completely impossible to work if you’re limited to 30k per minute.

AdsQnn avatar Aug 29 '25 21:08 AdsQnn

Waiting on guidance from maintainers (@ae-openai, @pakrym-oai, @codex-maintainers). I see three options before this issue can be considered closed:

  1. Revise docs to specify that Codex CLI is not compatible with the OpenAI API for Tier 1 organizations. OpenAI currently promotes Codex CLI on its official website alongside their API pricing and purchase options, and Codex CLI is sponsored/owned by OpenAI in the sense that the repo, releases, and maintainers are theirs. "We recommend using Codex with GPT-5, our best coding model." -OpenAI Given this, it would be very reasonable to expect OpenAI to issue refunds to Tier 1 organizations for recent API purchases if OpenAI decides to end support for Tier 1 organizations in Codex CLI.
  2. Increase GPT-5 TPM rates for Tier 1 organizations. I understand that OpenAI does need to allocate scarce GPU compute preferentially to high-paying customers for business reasons, but I doubt it is their intention to make the product completely unusable for lower-tier orgs.
  3. Some other workaround in Codex CLI that has not been considered yet.

The bottom line is that we need clarity on this.

mdbro avatar Aug 29 '25 21:08 mdbro

I think that increasing the per-minute limit while reducing the context window (to prevent overload) for tier 1 would be a better option and would allow Codex to be used through the API, because right now it’s unrealistic. That said, the CLI would still require the improvements I mentioned above, due to the limited context.

AdsQnn avatar Aug 29 '25 22:08 AdsQnn

Clarifying scope so this does not get closed for the wrong reason

This is not a context-window overflow. GPT-5 supports ~400K tokens per request. The blocker is the Tier-1 API rate cap = 30K tokens per minute. Codex CLI still sends >30K/minute, so runs are failing mid-stream. Reproduced in v0.27.0. There is new retry logic/backoff is progress in v0.27.0, but without reducing request size, all retries still fail.

Out of scope for this bug

  • Reducing model context size (irrelevant because this would not reduce what the CLI sends, and the actual limit hit is throughput per minute, not per-request capacity).
  • ChatGPT Plus/Pro sign-in (irrelevant to API access).

Maintainers (@ae-openai @pakrym-oai @codex-maintainers): please confirm ownership of this fix.

Credit: @AdsQnn’s snapshot/trim idea is useful, but the failure mode is strictly TPM rate-limiting, not context.

mdbro avatar Aug 30 '25 18:08 mdbro

@mdbro I described both above :) Both are true — you can’t push more than 30k tokens at once if the limit is 30k per minute, so without trimming it won’t work. By “context limitation” I mean preventing overload on OpenAI’s side. I admit I didn’t pay close attention to this earlier. But still, it doesn’t change the fact that in order not to restrict context, you’d need at least 500k per minute

What does raising the per-minute limit to 128k actually give you, if you’re trying to load a context of over 200k? In order for it to actually work, the per-minute limit has to be at least twice as large as the context window.

AdsQnn avatar Aug 30 '25 18:08 AdsQnn

@AdsQnn Thanks for engaging on this. I really appreciate the follow up. To keep the terms straight:

The issue is not context, it’s throughput. Codex is sending ~20-40K input tokens and then requesting output, so prompt + completion blows past the 30K TPM bucket even though GPT-5’s ~400K context window isn’t being hit.

On the 128K question: raising the TPM cap to 128k/min would let a single 20-40K prompt + 10-15K completion call complete, because 30K → 128K removes the per minute choke. Today those same calls are failing only because 30K/min is too small, not because of anything to do with context window size.

TPM and context window are independent. You don’t need TPM to be “twice the context window.” You need TPM > (prompt + expected completion) if you want the call to finish within a minute. The “2x window” claim assumes you fill the full context window every call and also expect an equally large completion, which is not how the CLI typically operates.

As a sanity check: other providers have very large windows (e.g., 1M) with much smaller TPM caps (Gemini CLI). That alone shows the two limits are decoupled.

mdbro avatar Aug 30 '25 19:08 mdbro

"On the 128K question: raising the TPM cap to 128k/min would let a single 20-40K prompt + 10-15K completion call complete, because 30K → 128K removes the per minute choke. Today those same calls are failing only because 30K/min is too small, not because of anything to do with context window size." @mdbro Unfortunately, you’re completely wrong. I don’t want to criticize what you’re saying, but think carefully — where does your assumption come from that the main context should be 30k and not 200k? Please reflect on what you’re writing, because it’s turning into a lot of unnecessary spam. How do you expect to load a 200k context window if the per-minute limit is only 128k? To ensure the context doesn’t exceed 30k, there must be some kind of restriction, right?

What, in your opinion, makes the main context end up being 20–40 instead of 200, and the iterations 10–15 instead of 40–60?

If, in the current configuration, you increased the TPM to 128k, it would allow you to ask 5 tasks, and you’d end up with exactly the same as now. Even after waiting an hour, nothing would change, because you’d still have more in context than the TPM allows. On top of that, as I said, the API most likely doesn’t use any workaround for loading entire files — and that would completely wreck the whole thing anyway.

To sum up, in order to work effectively this way, the TPM cannot be lower than the context window — because sooner or later you’ll inevitably hit that limit. The TPM must be larger than the maximum context, and ideally at least twice as much.

TPM – tokens per minute limit (e.g., 100k).

Context – the sum of all input tokens you actually send. (Max 200tys+)

Out – response tokens.


Rules

  1. For a request to go through:

Context + Out \leq C_{max}

Context + Out \leq TPM

  1. If Context > TPM, the request will never be executed – even after an hour.

Conclusion

If you keep adding context without trimming (no sliding window, no summarization), you will inevitably reach the point where:

Context > TPM

That is a bug in itself.

AdsQnn avatar Aug 30 '25 19:08 AdsQnn

@AdsQnn If I am wrong, please criticize what I am saying :).

I reviewed the docs again in more detail, and I think I am wrong here regarding TPM and context relationship on the API and you are correct. My misunderstanding was in regards to where the context is stored when using Codex CLI in "API-mode". It is now my understanding (please correct me if I am wrong) that the context lives only on the local machine when Codex CLI is running in "API-mode", and the entire context must be sent to OpenAI's servers on every call. Talk about token burn!

This behavior seems somewhat unique to agentic coding agents operating on the CLI: Gemini CLI (via their API) and Codex CLI in "ChatGPT-User-Account-mode" use server-side context caching: the context is stored on OpenAI's servers (or Google's servers in the case of Gemini CLI), and only the prompt + completion tokens are sent. @AdsQnn, thanks for pointing me in the right direction here. TPM pressure rises with context size and this issue is created by the CLI’s resend-everything prompt construction (which is a necessity with the API’s stateless, token-bucket rate accounting).

The bottom line remains: Codex CLI is not working with GPT-5 for tier 1 API accounts. The Codex CLI project is owned by OpenAI and advertised publicly by OpenAI in their API docs as working with GPT-5. However, due TPM limits, GPT-5 is only available on Codex CLI via the API for enterprise level/tier accounts. Once again: Maintainers/ OpenAI dev team (@ae-openai @pakrym-oai @codex-maintainers): please confirm ownership of this fix.

This issue should stay open until either the docs are changed to clarify that Codex is non-functional with tier-1 API accounts, or the TPM limits for tier 1 API are increased.

If I am wrong here and there is another fix, please correct me.

mdbro avatar Aug 31 '25 18:08 mdbro

@mdbro You’re thinking well, but in this case you’re overcomplicating it. Imagine you have a car (the context) that’s 3 meters wide, and a gate (the TPM) that’s only 2 meters wide. What happens when you try to drive the car through the gate? You won’t fit — just like the context(140) doesn’t fit into the TPM(128).

So what would you need to do to drive through the gate freely, without forcing your way in? There are a few options: A) Widen the gate to 6 meters. B) Narrow the car down to 1 meter. C) Find a compromise: for example, make the gate 4 meters wide and the car 2 meters.

How does this apply to the situation above? In order for the context to pass smoothly through the TPM gate, the gate must always be wider than the context. Otherwise, sooner or later, you’ll get stuck.

At the moment, the maximum context is 30k, because the gate won’t allow more. But imagine for tier 1 you widen the gate to 128k and narrow the maximum context to 60k. In that setup, you won’t get stuck at the gate — at worst, you’ll just have to wait a minute until it opens again, and then you can keep going(1min).

And in this way, you reach a compromise without risking API overloads. However, to avoid hitting the “context too large” error again, if the limit is capped at 60, there must be some kind of sliding window built into the CLI.

The context grows with each iteration, and even if you increase the TPM to gain some extra time — maybe squeeze in two more questions — eventually you’ll end up stuck, trying to push a 3-meter car through a 2-meter gate, and you simply won’t get through.

Otherwise, allowing a situation where the context can outgrow the TPM is a guaranteed bug and a fundamental design flaw — just like building a 2-meter gate for 3-meter cars.

At the moment, you’re trying to let a 400-meter car (400k context) pass through a 30-meter gate (30k TPM), and nothing is stopping it.

The solution is simple: you can’t allow the car to be wider than the gate — there always has to be some clearance. In other words, the TPM must be larger than the maximum allowed context, otherwise you’ll eventually get stuck.

Simply put, in the current situation, raising the limit will only let you ask a few extra questions before ending up in the same place — unless you raise the limit beyond 400k.

I’m not sure if this fully captures the idea, but that’s the gist. :)

AdsQnn avatar Aug 31 '25 18:08 AdsQnn

Correct, we are in agreement. Without server-side context caching (a persistent prefix/cache handle), the client must re-send history (context) every turn, so TPM and context usage couple after enough responses; eventually you "get stuck" if TPM is not >> the model's context window. Most large-model/ large-context coding CLIs avoid this by providing a server-side cache to reference. Codex API-key mode does not, which is why TPM is such an issue here.

But I think we are getting sidetracked. For this issue to be properly closed, either the Codex CLI docs need to change or the TPM restrictions need to change (unless anyone has an alternative solution to this issue).

mdbro avatar Sep 01 '25 04:09 mdbro

Same issue.. cannot use it via API atm

damtt avatar Sep 03 '25 08:09 damtt

I would like to note one thing tho, because discussion went into caching context and whatnot, in my case the issue was on the first request where the only context was the file provided for the model - "Change A to B in xxx.yy file".

Wydoyolo avatar Sep 03 '25 20:09 Wydoyolo

It is fully un useable at moment I downgraded to 0.20 change nothing. stream disconnected before completion: Rate limit reached for gpt-5 in organization xxxxxxxx on tokens per min (TPM): Limit 30000, Used 22602, Requested 10429. Please try again in 6.062s. Visit https://platform.openai.com/account/rate-limits to learn more.

Innomatix46 avatar Sep 04 '25 04:09 Innomatix46