cookbook Bug Report the model often starts creating repetitive sequences of tokens

Description of the bug:

Summary: When using the “gemini-1.5-flash” model for generating long texts, the model often starts creating repetitive sequences of tokens, leading to an infinite loop and exhausting the token limit. This issue is observed with both the Vertex and Gemini APIs.

Example: ``` “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed…” Steps to Reproduce:

Use the "gemini-1.5-flash" model via Vertex or Gemini API. Generate a long text (e.g., legal or technical document). Observe the generated output for repetition of phrases or sentences. Expected Behavior: The model should generate coherent and non-repetitive text.

Actual Behavior: The model begins to repeat sequences of tokens indefinitely, leading to the maximum token limit being reached.

Impact:

Wastes tokens and API usage limits. Generates unusable text, necessitating additional requests and costs. Reproduction Rate: Occurs frequently with long text generation tasks.

Workaround: Currently, there is no known workaround to prevent this issue.

Request for Resolution:

Investigate the cause of the repetitive token generation. Implement a fix to prevent the model from entering a repetitive loop. Provide a mechanism for users to request refunds or credits for tokens wasted due to this bug.

Actual vs expected behavior:

Actual: “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed…”

Expected: “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. ”

Any other information you'd like to share?

No response

Jun 26 '24 22:06 rossanodr

@rossanodr,

Thank you reporting this issue. This repository is for issues related to Gemini API Cookbook quickstarts and examples. For issues related to Gemini API, we would suggest you to use "Send Feedback" option in Gemini docs. Ref: Screenshot below. You can also post this issue on Google AI forum.

Jun 27 '24 06:06 singhniraj08

Thank you but Unfortunately, I did not receive any response from any of them.

@rossanodr,

Thank you reporting this issue. This repository is for issues related to Gemini API Cookbook quickstarts and examples. For issues related to Gemini API, we would suggest you to use "Send Feedback" option in Gemini docs. Ref: Screenshot below. You can also post this issue on Google AI forum.

Jun 27 '24 13:06 rossanodr

Description of the bug:

Summary: When using the “gemini-1.5-flash” model for generating long texts, the model often starts creating repetitive sequences of tokens, leading to an infinite loop and exhausting the token limit. This issue is observed with both the Vertex and Gemini APIs.

Example: ``` “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed…” Steps to Reproduce:

Use the "gemini-1.5-flash" model via Vertex or Gemini API. Generate a long text (e.g., legal or technical document). Observe the generated output for repetition of phrases or sentences. Expected Behavior: The model should generate coherent and non-repetitive text.

Actual Behavior: The model begins to repeat sequences of tokens indefinitely, leading to the maximum token limit being reached.

Impact:

Wastes tokens and API usage limits. Generates unusable text, necessitating additional requests and costs. Reproduction Rate: Occurs frequently with long text generation tasks.

Workaround: Currently, there is no known workaround to prevent this issue.

Request for Resolution:

Investigate the cause of the repetitive token generation. Implement a fix to prevent the model from entering a repetitive loop. Provide a mechanism for users to request refunds or credits for tokens wasted due to this bug.

Actual vs expected behavior:

Actual: “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed…”

Expected: “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. ”

Any other information you'd like to share?

No response

nbdh

Jun 29 '24 11:06 ghost

We are experiencing the same issue

Jul 04 '24 10:07 mioruggieroguida

We are experiencing the same issue

I posted the same issue on gemini forum. It would be nice if you could make some noise there too, to bring attention to the problem https://discuss.ai.google.dev/t/bug-report-the-model-often-starts-creating-repetitive-sequences-of-tokens/6445

Jul 04 '24 12:07 rossanodr

@rossanodr Done.

Did you manage to make any progress on this?

Jul 10 '24 08:07 mioruggieroguida

@rossanodr Done.

Did you manage to make any progress on this?

No :( Unfortunately, I think the problem is with Gemini. It is happening with many different prompts. The main issue is the large context. Let's say your prompt is something like, "Read the document below and make a list of all dates of birthdays on it {list}". If the document is large, it has a chance of starting to repeat the same date until it reaches the token limit.

Jul 10 '24 18:07 rossanodr

Marking this issue as stale since it has been open for 14 days with no activity. This issue will be closed if no further activity occurs.

Aug 16 '24 03:08 github-actions[bot]

I have too

Aug 16 '24 11:08 bastien8060

I solved it like this (So you have to repeat yourself in order for the model not to repeat itself): Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself).

Aug 22 '24 18:08 AmosDinh

I solved it like this (So you have to repeat yourself in order for the model not to repeat itself): Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself).

There are some improvements, but the repetition issue still occurs. It seems like this is an unavoidable bug. I hope other LLMs don't have this problem.

Sep 04 '24 05:09 zxl777

@zxl777 Fixed it.

Basically, I asked Gemini to rephrase my prompt, and I moved the prompt from the system instruction to the actual chat/feed. Removing data structure (and just explaining it in the prompt) also improved performances.

I also turned temperature and top-p way down to 0.

Everything helped to a certain extent, but leaving system prompt empty worked much more.

Sep 04 '24 05:09 bastien8060

I realized basically that system prompt is good to restrict the model or give it guidelines (ethics etc), but it's not good at following instructions from there

Sep 04 '24 05:09 bastien8060

@zxl777 Fixed it.

Basically, I asked Gemini to rephrase my prompt, and I moved the prompt from the system instruction to the actual chat/feed. Removing data structure (and just explaining it in the prompt) also improved performances.

I also turned temperature and top-p way down to 0.

Everything helped to a certain extent, but leaving system prompt empty worked much more.

not working for me. same errors

Sep 04 '24 14:09 rossanodr

Having same issue using gemini-1.5-flash-8b-001

Nov 29 '24 02:11 boustanihani

Having the same issue using gemini-1.5-flash-002

Dec 05 '24 01:12 homiecal

Having the same issue using gemini-1.5-flash-002

same here

Dec 05 '24 20:12 rossanodr

Having the same issue using gemini-1.5-flash-001 or gemini-1.5-flash-002

Dec 11 '24 11:12 Trix-MarcoSassarini

Having the same issue here gemini-1.5-flash-002

Dec 12 '24 19:12 Mrulay

same with gemini pro

Dec 26 '24 17:12 zmays

Marking this issue as stale since it has been open for 14 days with no activity. This issue will be closed if no further activity occurs.

Jan 09 '25 22:01 github-actions[bot]

same issue

Jan 17 '25 12:01 jblorenzo

One thing you can try is increasing the temperature as a temperature of 0 increases the chances of the model looping.

Jan 20 '25 12:01 Giom-V

Same Issue gemini-1.5-pro-001

Feb 01 '25 12:02 hagino3000

We had this happen in a few-shot document classification/entity extraction use case. We were able to fix it by downgrading google-cloud-aiplatform:

pip install "google-cloud-aiplatform==1.69.0" --force-reinstall

We can reproduce the error reliably in sandboxed environments by switching the package versions back and forth - from looping text output until token limit is reached (latest version) to straight forward correct output (version 1.69.0). Sadly, the few shot documents used to create this problem are proprietary and I can't share them. I really can't imagine what the package is doing to influence the model outputs this badly, but thats what we're working with...

Maybe this solves the issue for anyone else who has this problem!

Quick update: The problem consistently appears with "google-cloud-aiplatform==1.69.0" as well. So either I got extremely lucky in all my previous experiments, or they did some changes in the backend that break our initial downgrade fix.

Feb 04 '25 16:02 felixvor

I'm having the same issue with gemini-2.0-flash.

Also doing data extraction. Have seen the issue on input token sizes 5K to 100K. The issue is inconsistent on temp=0, most of the time it works as expected and then very sporadically it breaks and repeats itself indefinitely.

I'm using structured output with Pydantic classes.

I've managed to reduce errors by adding lines to my system prompt like VERY IMPORTANT: DO NOT ENDLESSLY REPEAT YOURSELF. More testing on my end needs to be done to see how well this has solved the issue.

Has anyone found a more comprehensive fix for this and/or is it on Google's radar?

Apr 23 '25 11:04 osheaoshea

It is on our radar, but so far the only advice I can give you is to give clear instructions to the model (like tell it to not answer more than X items like we do in the spatial understanding guide), or to use a non-zero temperature (like we also do in that guide).

Apr 23 '25 12:04 Giom-V

It is on our radar, but so far the only advice I can give you is to give clear instructions to the model (like tell it to not answer more than X items like we do in the spatial understanding guide), or to use a non-zero temperature (like we also do in that guide).

Thanks Giom!

Apr 29 '25 14:04 osheaoshea

@osheaoshea has changing the temperature to slightly above 0 and adding the "very important" line to your system prompts worked, or at least lowered the number of occurrences? :)

Jun 10 '25 16:06 Gaberc

@osheaoshea has changing the temperature to slightly above 0 and adding the "very important" line to your system prompts worked, or at least lowered the number of occurrences? :)

Not really. We're still seeing this spamming error roughly 10-15% of the time on our data extraction prompts.

Non-zero temperature and constant tweaks to our system prompts didn't help; however, it's quite difficult to test changes as the error is quite sporadic and inconsistent. Unfortunately, for our use case, it's difficult to include lines like "tell it to not answer more than X items" (as suggested by @Giom-V) in our system prompt. I suspect the nature of our open-ended prompts (e.g. 'extract everything to do with X') is not helping the errors.

We've opted to use a fallback model to redo the Gemini request if it fails in this way. This has increased our costs and slowed down our processing time but is reliable in production. It's on my list to experiment more with the way we prompt Gemini to see if I can reduce our errors - I'll update here if I find something useful. Interested if you've found anything that works for this particular use case.

Jun 10 '25 16:06 osheaoshea

cookbook cookbook copied to clipboard

Bug Report the model often starts creating repetitive sequences of tokens

Description of the bug:

Actual vs expected behavior:

Any other information you'd like to share?

Description of the bug:

Actual vs expected behavior:

Any other information you'd like to share?

cookbook
cookbook copied to clipboard