cookbook icon indicating copy to clipboard operation
cookbook copied to clipboard

thinking_budget=0 does not work,

Open miroblog opened this issue 7 months ago • 19 comments

Description of the bug:

response.usage_metadata.thoughts_token_count

I get none 0 for thought token,

# now len(prompt) > over 100k
response = client.models.generate_content(
    model="gemini-2.5-flash-preview-04-17",
    contents=[prompt],
    config=genai.types.GenerateContentConfig(
        thinking_config=genai.types.ThinkingConfig(thinking_budget=0)
    ),
)
print(
    f"Token usage: prompt={response.usage_metadata.prompt_token_count}, "
    f"candidates={response.usage_metadata.candidates_token_count}, "
    f"thoughts={response.usage_metadata.thoughts_token_count}"
)

I get Token usage: prompt=100858, candidates=4076, thoughts=1515

Actual vs expected behavior:

Token usage: prompt=100858, candidates=4076, thoughts=0

Any other information you'd like to share?

This works for shorter prompt but for longer prompt I get none 0 thought tokens, even though i set it as budget 0

miroblog avatar Apr 18 '25 04:04 miroblog

Hello @miroblog, I'm not able to reproduce the issue, even with context longer than yours. Can you share with me the long context you are using?

Giom-V avatar Apr 18 '25 15:04 Giom-V

similar issue raised https://discuss.ai.google.dev/t/gemini-2-5-flash-preview-04-17-not-honoring-thinking-budget-0/80165/3

miroblog avatar Apr 20 '25 13:04 miroblog

Hey @miroblog , I tried it with a larger context (around 200k), and now it's working as expected. The issue seems to be fixed.

Let me know if you're still facing the issue. Thanks.

Gunand3043 avatar Apr 22 '25 05:04 Gunand3043

This is not fixed for me, I find it's highly prompt dependent though. Some prompts will run fine with no thinking 10/10 times, others will use thinking tokens at least 50% of the time.

kyleholgate avatar Apr 22 '25 11:04 kyleholgate

@miroblog The team confirms there's a known issue where Gemini sometimes still thinks a bit even when told not to. I'll keep the thread up-to-date when I'll get some updates.

Giom-V avatar Apr 22 '25 11:04 Giom-V

is there any update to this issue ?

X901 avatar Apr 25 '25 20:04 X901

I have the same issue. Is there an update for this yet? @Giom-V

vaidy12345 avatar Apr 28 '25 06:04 vaidy12345

I think this issue fixed in last model update gemini-2.5-flash-preview-05-20

X901 avatar May 21 '25 15:05 X901

That's awesome, thank you so much :)

On Wed, 21 May 2025 at 20:57, Basel Baragabah @.***> wrote:

X901 left a comment (google-gemini/cookbook#722) https://github.com/google-gemini/cookbook/issues/722#issuecomment-2898377762

I think this issue fixed in last model update gemini-2.5-flash-preview-05-20

— Reply to this email directly, view it on GitHub https://github.com/google-gemini/cookbook/issues/722#issuecomment-2898377762, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7WYHGCXHVJZBULMOIQRXXL27SLPZAVCNFSM6AAAAAB3MKLQNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOJYGM3TONZWGI . You are receiving this because you commented.Message ID: @.***>

vaidy12345 avatar May 21 '25 15:05 vaidy12345

Issue seems to persist with the new model as well. Even though I set thinking_budget=0, the model still uses thinking tokens in most of my test cases.

kanzyai-emirarditi avatar May 21 '25 18:05 kanzyai-emirarditi

Just checking in to see if there are any updates? I’m still experiencing the issue even with thinking_budget=0. Would be great to know if a fix is in progress or if there’s a recommended workaround. Thanks!

cspiecker avatar Jun 26 '25 13:06 cspiecker

Just checking in to see if there are any updates? I’m still experiencing the issue even with thinking_budget=0. Would be great to know if a fix is in progress or if there’s a recommended workaround. Thanks!

they release the final version few days ago check it out

X901 avatar Jun 26 '25 13:06 X901

woking on 2.5-flash final version. With reasoning_effort="none", there is still a small chance 2.5 flash will think, maybe caused by a custom thinking key in my result json structure. Now I'm testing thinking_budget.

usage=CompletionUsage(completion_tokens=88, prompt_tokens=1238, total_tokens=2176, completion_tokens_details=None, prompt_tokens_details=None)

mammothrider avatar Jul 01 '25 09:07 mammothrider

woking on 2.5-flash final version. With reasoning_effort="none", there is still a small chance 2.5 flash will think, maybe caused by a custom thinking key in my result json structure. Now I'm testing thinking_budget.

usage=CompletionUsage(completion_tokens=88, prompt_tokens=1238, total_tokens=2176, completion_tokens_details=None, prompt_tokens_details=None)

In the response if you see thoughts with some numbers It mean it was thinking but if it was thoughts=0 it was not thinking

X901 avatar Jul 01 '25 09:07 X901

woking on 2.5-flash final version. With reasoning_effort="none", there is still a small chance 2.5 flash will think, maybe caused by a custom thinking key in my result json structure. Now I'm testing thinking_budget.

usage=CompletionUsage(completion_tokens=88, prompt_tokens=1238, total_tokens=2176, completion_tokens_details=None, prompt_tokens_details=None)

In the response if you see thoughts with some numbers It mean it was thinking but if it was thoughts=0 it was not thinking

I don't catch thinking or thoughts in my api call response, but some of my requests gave me Could not parse response content as the length limit was reached error. I set max_tokens to 1024, and normally my results are around 100 tokens. And all these error logs have a total_tokens more than 2k tokens. After some search, total_tokens = prompt_tokens + output_token, and output_token = thinking_tokens + completion_tokens, so I can only guess for some requests, gemini is doing some thinking.

Edit: I test with thinking_budget=0, and the problem still exists.

extra_body = "extra_body": {
                "google": {
                    "thinking_config": {
                        "thinking_budget": 0,
                        "include_thoughts": true  # for debug purpose
                    }
                }
            }

and about 10% of the request will think. One result completion after model dump:

{
    "asctime": "2025-07-04 14:06:11",
    "severity": "DEBUG",
    "name": "services.openai_api",
    "module": "openai_api",
    "funcName": "get_completion_with_formatter",
    "lineno": 222,
    "correlation_id": "-",
    "message": "",
    "id": "",
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "<thought>***</thought>{\n  \"analysis\": \"***\",\n  \"user_output\": \"***\"\n}",
                "refusal": null,
                "role": "assistant",
                "audio": null,
                "function_call": null,
                "tool_calls": [

                ],
                "parsed": {
                    "analysis": "***",
                    "user_output": "***"
                },
                "extra_content": {
                    "google": {
                        "thought": true
                    }
                }
            }
        }
    ],
    "created": 1751609171,
    "model": "gemini-2.5-flash",
    "object": "chat.completion",
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "completion_tokens": 87,
        "prompt_tokens": 1238,
        "total_tokens": 2164,
        "completion_tokens_details": null,
        "prompt_tokens_details": null
    }
}

The extra_content part does not exist in my normal requests. And I'm considering if this is some kind of a service level bug.

mammothrider avatar Jul 01 '25 09:07 mammothrider

This issue is occurring with gemini-2.5-flash-preview-09-2025. It does not respect the thinking_budget. Setting the budget to 0 has no impact, and even setting it to a definite amount e.g. 5k, results in more than 5k thinking tokens.

The problem only occurs when setting response_mime_type="application/json". But removing this is not option, since we need structured output.

stri8ed avatar Sep 25 '25 19:09 stri8ed

Agreed. It causes inconsistent outputs. gemini-2.5-flash-preview-09-2025 doesn't respect thinking_budget. In Google Studio, it works fine.

This issue is occurring with gemini-2.5-flash-preview-09-2025. It does not respect the thinking_budget. Setting the budget to 0 has no impact, and even setting it to a definite amount e.g. 5k, results in more than 5k thinking tokens.

sirusbaladi avatar Nov 06 '25 01:11 sirusbaladi

Confirmed, seeing the same issue. Any update @Giom-V ?

marcwestermann avatar Nov 13 '25 18:11 marcwestermann

This is still being worked on but we should have a solution soon I hope.

Giom-V avatar Nov 13 '25 22:11 Giom-V