claude-code [BUG] Vertex AI 413 - can't count tokens

Environment

Platform (select one):
- [ ] Anthropic API
- [ ] AWS Bedrock
- [X] Google Vertex AI
- [ ] Other:
Claude CLI version: 1.0.33
Operating System: macOS 15.5
Terminal: iTerm2

Bug Description

When using Claude through Vertex AI, requests intermittently fail with a 413 "prompt is too long" error. Endpoint count-tokens:rawPredict returns a 400 error, preventing accurate token counting and causing the 413 error (assumption).

It seems that issue is with field

"cache_control": {
    "type": "ephemeral"
  }

If I try to make same call without this cache_control, I received 200 OK. And in response I can see

{
    "input_tokens": 252461
}

So that's the reason for 413.

Steps to Reproduce

Analyze big codebase
It sends request with >200k tokens and fails

Expected Behavior

Chunk it to smaller requests

Additional Context

Jun 25 '25 06:06 Dasio

I am seeing the same thing with Vertex:

API Error: 413 {"error":{"message":"{\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"message\":\"Prompt is too long\"}}","type":"None","param":"None","code":"413"}

Even though: Context left until auto-compact: 2% At this point /compact does not work (same 413 error), the only way to get it out of this state is to restart or /clear

Looking at the payload, CC seems to be polluting context with large number of tools and codebase snippets. This mostly happens when analyzing a large, multimegabyte piece of code in a single file.

Jul 15 '25 14:07 nktl

We're seeing this issue as well.

I suppose Vertex has a limit on the count-tokens endpoint where Anthropic does not.

Claude Code team, this is a blocker for our usage in a large organization. Can you have the tool chunk the count-tokens requests, or another approach?

Jul 17 '25 23:07 jordanlewis

It also feels like that when CC gets 413, there should be some sort of heuristic throwing some less important/older pieces away from the context and re-trying - allowing to recover gracefully from this situation (rather than having to drop all the context).

I suspect there is more than one scenario possible where upstream API tells you go do away due to (perceived) tokens limit, with all these non-Anthropic-native hosts, enterprise LLM proxies, etc. out there.

Jul 18 '25 08:07 nktl

I'm seeing this issue recently in my setup too. This issue has some kind of workaround?

Oct 14 '25 07:10 jparrill

This issue has been inactive for 30 days. If the issue is still occurring, please comment to let us know. Otherwise, this issue will be automatically closed in 30 days for housekeeping purposes.

Dec 08 '25 10:12 github-actions[bot]

can't confirm because we switched to anthropic, but in changelog didn't see anything related to this

Dec 08 '25 17:12 Dasio