cookbook
cookbook copied to clipboard
thinking_budget=0 does not work,
Description of the bug:
response.usage_metadata.thoughts_token_count
I get none 0 for thought token,
# now len(prompt) > over 100k
response = client.models.generate_content(
model="gemini-2.5-flash-preview-04-17",
contents=[prompt],
config=genai.types.GenerateContentConfig(
thinking_config=genai.types.ThinkingConfig(thinking_budget=0)
),
)
print(
f"Token usage: prompt={response.usage_metadata.prompt_token_count}, "
f"candidates={response.usage_metadata.candidates_token_count}, "
f"thoughts={response.usage_metadata.thoughts_token_count}"
)
I get Token usage: prompt=100858, candidates=4076, thoughts=1515
Actual vs expected behavior:
Token usage: prompt=100858, candidates=4076, thoughts=0
Any other information you'd like to share?
This works for shorter prompt but for longer prompt I get none 0 thought tokens, even though i set it as budget 0
Hello @miroblog, I'm not able to reproduce the issue, even with context longer than yours. Can you share with me the long context you are using?
similar issue raised https://discuss.ai.google.dev/t/gemini-2-5-flash-preview-04-17-not-honoring-thinking-budget-0/80165/3
Hey @miroblog , I tried it with a larger context (around 200k), and now it's working as expected. The issue seems to be fixed.
Let me know if you're still facing the issue. Thanks.
This is not fixed for me, I find it's highly prompt dependent though. Some prompts will run fine with no thinking 10/10 times, others will use thinking tokens at least 50% of the time.
@miroblog The team confirms there's a known issue where Gemini sometimes still thinks a bit even when told not to. I'll keep the thread up-to-date when I'll get some updates.
is there any update to this issue ?
I have the same issue. Is there an update for this yet? @Giom-V
I think this issue fixed in last model update gemini-2.5-flash-preview-05-20
That's awesome, thank you so much :)
On Wed, 21 May 2025 at 20:57, Basel Baragabah @.***> wrote:
X901 left a comment (google-gemini/cookbook#722) https://github.com/google-gemini/cookbook/issues/722#issuecomment-2898377762
I think this issue fixed in last model update gemini-2.5-flash-preview-05-20
— Reply to this email directly, view it on GitHub https://github.com/google-gemini/cookbook/issues/722#issuecomment-2898377762, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7WYHGCXHVJZBULMOIQRXXL27SLPZAVCNFSM6AAAAAB3MKLQNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOJYGM3TONZWGI . You are receiving this because you commented.Message ID: @.***>
Issue seems to persist with the new model as well. Even though I set thinking_budget=0, the model still uses thinking tokens in most of my test cases.
Just checking in to see if there are any updates? I’m still experiencing the issue even with thinking_budget=0. Would be great to know if a fix is in progress or if there’s a recommended workaround. Thanks!
Just checking in to see if there are any updates? I’m still experiencing the issue even with thinking_budget=0. Would be great to know if a fix is in progress or if there’s a recommended workaround. Thanks!
they release the final version few days ago check it out
woking on 2.5-flash final version. With reasoning_effort="none", there is still a small chance 2.5 flash will think, maybe caused by a custom thinking key in my result json structure. Now I'm testing thinking_budget.
usage=CompletionUsage(completion_tokens=88, prompt_tokens=1238, total_tokens=2176, completion_tokens_details=None, prompt_tokens_details=None)
woking on 2.5-flash final version. With
reasoning_effort="none", there is still a small chance 2.5 flash will think, maybe caused by a custom thinking key in my result json structure. Now I'm testingthinking_budget.usage=CompletionUsage(completion_tokens=88, prompt_tokens=1238, total_tokens=2176, completion_tokens_details=None, prompt_tokens_details=None)
In the response if you see thoughts with some numbers
It mean it was thinking but if it was thoughts=0
it was not thinking
woking on 2.5-flash final version. With
reasoning_effort="none", there is still a small chance 2.5 flash will think, maybe caused by a custom thinking key in my result json structure. Now I'm testingthinking_budget.usage=CompletionUsage(completion_tokens=88, prompt_tokens=1238, total_tokens=2176, completion_tokens_details=None, prompt_tokens_details=None)In the response if you see
thoughtswith some numbers It mean it was thinking but if it wasthoughts=0it was not thinking
I don't catch thinking or thoughts in my api call response, but some of my requests gave me Could not parse response content as the length limit was reached error. I set max_tokens to 1024, and normally my results are around 100 tokens. And all these error logs have a total_tokens more than 2k tokens. After some search, total_tokens = prompt_tokens + output_token, and output_token = thinking_tokens + completion_tokens, so I can only guess for some requests, gemini is doing some thinking.
Edit: I test with thinking_budget=0, and the problem still exists.
extra_body = "extra_body": {
"google": {
"thinking_config": {
"thinking_budget": 0,
"include_thoughts": true # for debug purpose
}
}
}
and about 10% of the request will think. One result completion after model dump:
{
"asctime": "2025-07-04 14:06:11",
"severity": "DEBUG",
"name": "services.openai_api",
"module": "openai_api",
"funcName": "get_completion_with_formatter",
"lineno": 222,
"correlation_id": "-",
"message": "",
"id": "",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "<thought>***</thought>{\n \"analysis\": \"***\",\n \"user_output\": \"***\"\n}",
"refusal": null,
"role": "assistant",
"audio": null,
"function_call": null,
"tool_calls": [
],
"parsed": {
"analysis": "***",
"user_output": "***"
},
"extra_content": {
"google": {
"thought": true
}
}
}
}
],
"created": 1751609171,
"model": "gemini-2.5-flash",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": null,
"usage": {
"completion_tokens": 87,
"prompt_tokens": 1238,
"total_tokens": 2164,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
}
The extra_content part does not exist in my normal requests. And I'm considering if this is some kind of a service level bug.
This issue is occurring with gemini-2.5-flash-preview-09-2025. It does not respect the thinking_budget. Setting the budget to 0 has no impact, and even setting it to a definite amount e.g. 5k, results in more than 5k thinking tokens.
The problem only occurs when setting response_mime_type="application/json". But removing this is not option, since we need structured output.
Agreed. It causes inconsistent outputs. gemini-2.5-flash-preview-09-2025 doesn't respect thinking_budget. In Google Studio, it works fine.
This issue is occurring with
gemini-2.5-flash-preview-09-2025. It does not respect thethinking_budget. Setting the budget to 0 has no impact, and even setting it to a definite amount e.g. 5k, results in more than 5k thinking tokens.
Confirmed, seeing the same issue. Any update @Giom-V ?
This is still being worked on but we should have a solution soon I hope.