[BUG] Vertex AI 413 - can't count tokens
Environment
- Platform (select one):
- [ ] Anthropic API
- [ ] AWS Bedrock
- [X] Google Vertex AI
- [ ] Other:
- Claude CLI version: 1.0.33
- Operating System: macOS 15.5
- Terminal: iTerm2
Bug Description
When using Claude through Vertex AI, requests intermittently fail with a 413 "prompt is too long" error. Endpoint count-tokens:rawPredict returns a 400 error, preventing accurate token counting and causing the 413 error (assumption).
It seems that issue is with field
"cache_control": {
"type": "ephemeral"
}
If I try to make same call without this cache_control, I received 200 OK. And in response I can see
{
"input_tokens": 252461
}
So that's the reason for 413.
Steps to Reproduce
- Analyze big codebase
- It sends request with >200k tokens and fails
Expected Behavior
Chunk it to smaller requests
Additional Context
I am seeing the same thing with Vertex:
API Error: 413 {"error":{"message":"{\"type\":\"error\",\"error\":{\"type\":\"invalid_request_error\",\"message\":\"Prompt is too long\"}}","type":"None","param":"None","code":"413"}
Even though: Context left until auto-compact: 2%
At this point /compact does not work (same 413 error), the only way to get it out of this state is to restart or /clear
Looking at the payload, CC seems to be polluting context with large number of tools and codebase snippets. This mostly happens when analyzing a large, multimegabyte piece of code in a single file.
We're seeing this issue as well.
I suppose Vertex has a limit on the count-tokens endpoint where Anthropic does not.
Claude Code team, this is a blocker for our usage in a large organization. Can you have the tool chunk the count-tokens requests, or another approach?
It also feels like that when CC gets 413, there should be some sort of heuristic throwing some less important/older pieces away from the context and re-trying - allowing to recover gracefully from this situation (rather than having to drop all the context).
I suspect there is more than one scenario possible where upstream API tells you go do away due to (perceived) tokens limit, with all these non-Anthropic-native hosts, enterprise LLM proxies, etc. out there.
I'm seeing this issue recently in my setup too. This issue has some kind of workaround?
This issue has been inactive for 30 days. If the issue is still occurring, please comment to let us know. Otherwise, this issue will be automatically closed in 30 days for housekeeping purposes.
can't confirm because we switched to anthropic, but in changelog didn't see anything related to this