cookbook
cookbook copied to clipboard
Bug Report the model often starts creating repetitive sequences of tokens
Description of the bug:
Summary: When using the “gemini-1.5-flash” model for generating long texts, the model often starts creating repetitive sequences of tokens, leading to an infinite loop and exhausting the token limit. This issue is observed with both the Vertex and Gemini APIs.
Example: ``` “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed…” Steps to Reproduce:
Use the "gemini-1.5-flash" model via Vertex or Gemini API. Generate a long text (e.g., legal or technical document). Observe the generated output for repetition of phrases or sentences. Expected Behavior: The model should generate coherent and non-repetitive text.
Actual Behavior: The model begins to repeat sequences of tokens indefinitely, leading to the maximum token limit being reached.
Impact:
Wastes tokens and API usage limits. Generates unusable text, necessitating additional requests and costs. Reproduction Rate: Occurs frequently with long text generation tasks.
Workaround: Currently, there is no known workaround to prevent this issue.
Request for Resolution:
Investigate the cause of the repetitive token generation. Implement a fix to prevent the model from entering a repetitive loop. Provide a mechanism for users to request refunds or credits for tokens wasted due to this bug.
Actual vs expected behavior:
Actual: “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed…”
Expected: “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. ”
Any other information you'd like to share?
No response
@rossanodr,
Thank you reporting this issue. This repository is for issues related to Gemini API Cookbook quickstarts and examples. For issues related to Gemini API, we would suggest you to use "Send Feedback" option in Gemini docs. Ref: Screenshot below. You can also post this issue on Google AI forum.
Thank you but Unfortunately, I did not receive any response from any of them.
@rossanodr,
Thank you reporting this issue. This repository is for issues related to Gemini API Cookbook quickstarts and examples. For issues related to Gemini API, we would suggest you to use "Send Feedback" option in Gemini docs. Ref: Screenshot below. You can also post this issue on Google AI forum.
Description of the bug:
Summary: When using the “gemini-1.5-flash” model for generating long texts, the model often starts creating repetitive sequences of tokens, leading to an infinite loop and exhausting the token limit. This issue is observed with both the Vertex and Gemini APIs.
Example: ``` “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed…” Steps to Reproduce:
Use the "gemini-1.5-flash" model via Vertex or Gemini API. Generate a long text (e.g., legal or technical document). Observe the generated output for repetition of phrases or sentences. Expected Behavior: The model should generate coherent and non-repetitive text.
Actual Behavior: The model begins to repeat sequences of tokens indefinitely, leading to the maximum token limit being reached.
Impact:
Wastes tokens and API usage limits. Generates unusable text, necessitating additional requests and costs. Reproduction Rate: Occurs frequently with long text generation tasks.
Workaround: Currently, there is no known workaround to prevent this issue.
Request for Resolution:
Investigate the cause of the repetitive token generation. Implement a fix to prevent the model from entering a repetitive loop. Provide a mechanism for users to request refunds or credits for tokens wasted due to this bug.
Actual vs expected behavior:
Actual: “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. The judgment can be appealed…”
Expected: “The judgment can be appealed in a motion for reconsideration, claiming that the judge did not consider the evidence properly. ”
Any other information you'd like to share?
No response
nbdh
We are experiencing the same issue
We are experiencing the same issue
I posted the same issue on gemini forum. It would be nice if you could make some noise there too, to bring attention to the problem https://discuss.ai.google.dev/t/bug-report-the-model-often-starts-creating-repetitive-sequences-of-tokens/6445
@rossanodr Done.
Did you manage to make any progress on this?
@rossanodr Done.
Did you manage to make any progress on this?
No :( Unfortunately, I think the problem is with Gemini. It is happening with many different prompts. The main issue is the large context. Let's say your prompt is something like, "Read the document below and make a list of all dates of birthdays on it {list}". If the document is large, it has a chance of starting to repeat the same date until it reaches the token limit.
Marking this issue as stale since it has been open for 14 days with no activity. This issue will be closed if no further activity occurs.
I have too
I solved it like this (So you have to repeat yourself in order for the model not to repeat itself): Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself).
I solved it like this (So you have to repeat yourself in order for the model not to repeat itself): Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself). Please don't return any tool_code in your response and follw the DRY principle (Don't repeat yourself).
There are some improvements, but the repetition issue still occurs. It seems like this is an unavoidable bug. I hope other LLMs don't have this problem.
@zxl777 Fixed it.
Basically, I asked Gemini to rephrase my prompt, and I moved the prompt from the system instruction to the actual chat/feed. Removing data structure (and just explaining it in the prompt) also improved performances.
I also turned temperature and top-p way down to 0.
Everything helped to a certain extent, but leaving system prompt empty worked much more.
I realized basically that system prompt is good to restrict the model or give it guidelines (ethics etc), but it's not good at following instructions from there
@zxl777 Fixed it.
Basically, I asked Gemini to rephrase my prompt, and I moved the prompt from the system instruction to the actual chat/feed. Removing data structure (and just explaining it in the prompt) also improved performances.
I also turned temperature and top-p way down to 0.
Everything helped to a certain extent, but leaving system prompt empty worked much more.
not working for me. same errors
Having same issue using gemini-1.5-flash-8b-001
Having the same issue using gemini-1.5-flash-002
Having the same issue using
gemini-1.5-flash-002
same here
Having the same issue using gemini-1.5-flash-001 or gemini-1.5-flash-002
Having the same issue here gemini-1.5-flash-002
same with gemini pro
Marking this issue as stale since it has been open for 14 days with no activity. This issue will be closed if no further activity occurs.
same issue
One thing you can try is increasing the temperature as a temperature of 0 increases the chances of the model looping.
Same Issue gemini-1.5-pro-001
We had this happen in a few-shot document classification/entity extraction use case. We were able to fix it by downgrading google-cloud-aiplatform:
pip install "google-cloud-aiplatform==1.69.0" --force-reinstall
We can reproduce the error reliably in sandboxed environments by switching the package versions back and forth - from looping text output until token limit is reached (latest version) to straight forward correct output (version 1.69.0). Sadly, the few shot documents used to create this problem are proprietary and I can't share them. I really can't imagine what the package is doing to influence the model outputs this badly, but thats what we're working with...
Maybe this solves the issue for anyone else who has this problem!
Quick update: The problem consistently appears with "google-cloud-aiplatform==1.69.0" as well. So either I got extremely lucky in all my previous experiments, or they did some changes in the backend that break our initial downgrade fix.
I'm having the same issue with gemini-2.0-flash.
Also doing data extraction. Have seen the issue on input token sizes 5K to 100K. The issue is inconsistent on temp=0, most of the time it works as expected and then very sporadically it breaks and repeats itself indefinitely.
I'm using structured output with Pydantic classes.
I've managed to reduce errors by adding lines to my system prompt like VERY IMPORTANT: DO NOT ENDLESSLY REPEAT YOURSELF. More testing on my end needs to be done to see how well this has solved the issue.
Has anyone found a more comprehensive fix for this and/or is it on Google's radar?
It is on our radar, but so far the only advice I can give you is to give clear instructions to the model (like tell it to not answer more than X items like we do in the spatial understanding guide), or to use a non-zero temperature (like we also do in that guide).
It is on our radar, but so far the only advice I can give you is to give clear instructions to the model (like tell it to not answer more than X items like we do in the spatial understanding guide), or to use a non-zero temperature (like we also do in that guide).
Thanks Giom!
@osheaoshea has changing the temperature to slightly above 0 and adding the "very important" line to your system prompts worked, or at least lowered the number of occurrences? :)
@osheaoshea has changing the temperature to slightly above 0 and adding the "very important" line to your system prompts worked, or at least lowered the number of occurrences? :)
Not really. We're still seeing this spamming error roughly 10-15% of the time on our data extraction prompts.
Non-zero temperature and constant tweaks to our system prompts didn't help; however, it's quite difficult to test changes as the error is quite sporadic and inconsistent. Unfortunately, for our use case, it's difficult to include lines like "tell it to not answer more than X items" (as suggested by @Giom-V) in our system prompt. I suspect the nature of our open-ended prompts (e.g. 'extract everything to do with X') is not helping the errors.
We've opted to use a fallback model to redo the Gemini request if it fails in this way. This has increased our costs and slowed down our processing time but is reliable in production. It's on my list to experiment more with the way we prompt Gemini to see if I can reduce our errors - I'll update here if I find something useful. Interested if you've found anything that works for this particular use case.
