text-generation-webui DRY: A modern repetition penalty that reliably prevents looping

Looping is an undesirable behavior where the model repeats phrases verbatim that have previously occurred in the input. It affects most models, and is exacerbated by the use of truncation samplers. Chat formats are particularly susceptible due to their regular structure, which models appear to interpret as an invitation to repeat previous messages in whole or in part. Prompting the model to avoid looping has little or no effect.

The traditional weapon to combat looping are the three flavors of repetition penalty that are built into most loaders (multiplicative, additive, and frequency penalty). But those samplers are rather blunt instruments that distort the grammar of standard language, which the model has been painstakingly trained to reproduce. I have previously attempted to fix this problem by introducing a parameter that protects the basic structure of language from being penalized, but that's a hacky solution that fails to do the right thing in many cases, and even in their raw form, classical repetition penalties don't actually prevent looping reliably.

In the past weeks, I have rethought the looping problem from the ground up, and in this PR present the DRY repetition penalty, a mechanism that is able to detect textual looping and steer against it. It is far superior to the existing samplers at preventing verbatim repetition, while having essentially none of their negative effects on language structure. The result is less repetitive and higher quality output.

I have tested this sampler for about 20 hours in chat scenarios so far, and they have without question been the highest-quality chats I have ever experienced. Looping in the traditional sense simply does not happen with DRY, and the positive effects from being able to drop the standard repetition penalty are very noticeable.

How it works

DRY penalizes tokens that would extend the end of the input into a sequence that has previously occurred in the input.

dry

In this example, violets is penalized in the probability distribution generated by the model because the sequence roses are red has previously occurred in the input, and has been continued with violets in that previous case. Therefore, the penalty discourages the model from repeating sequences in its output, which is the definition of looping.

The penalty for a token is calculated as

multiplier * base ^ (n - allowed_length)

where n is the length of the sequence before that token that matches the end of the input, and multiplier, base, and allowed_length are configurable parameters. If the length of the matching sequence is less than allowed_length, no penalty is applied.

Thus the penalty grows exponentially as the repeated sequence gets longer. This will quickly overcome even the strongest tendency of the model to repeat itself. With the right parameter choice, looping is literally impossible with DRY (that is, verbatim textual looping is impossible – the model can of course still repeat itself by paraphrasing and situational looping, but that is far less annoying than the broken-record looping that is common now). All of that happens without affecting non-repeating text in any way.

Sequence breakers

As straightforward as the mechanism described above may appear, it runs into a major problem in practice.

Instruction and chat templates themselves contain lengthy repeating token sequences. For example, with ChatML, the following sequence precedes every message generated by the bot:

\n
<|im_end|> \n
<|im_start|>assistant \n
Bot name:

That's at least 11 tokens before the first token of the message that are guaranteed to occur previously in the input. With an exponentially increasing penalty being applied (and we definitely don't want 12-token repetitions in normal text), any starting token of a bot message can be used only once in the entire chat. That's a huge problem that distorts how chat messages are generated, e.g. when messages are expected to regularly begin with quotation marks.

To solve this and related issues, I have added another parameter, sequence_breakers, which is a list of tokens that interrupt sequence matching. That is, matches are not continued across such tokens, which effectively breaks the input into parts where matching can be applied.

sequence_breakers can be conveniently specified as a JSON array of strings, which will be encoded into token IDs using the loaded model's tokenizer. The default list consists of \n, :, ", and *.

How to use

DRY is disabled by default (multiplier set to 0). It can be configured from the Parameters tab; I recommend the following parameter values:

dry-parameters

Note that like all transformers-based samplers, DRY only works with transformers-based loaders such as llamacpp_HF, ExLlamav2_HF, or Transformers itself. It does not work with the vanilla llama.cpp or ExLlamav2 loaders.

If you want the model to regularly repeat certain sequences verbatim (e.g. long character names in chat formats), you can add the individual words comprising those sequences to the sequence_breakers list (for names, just add first and last names there as separate strings). This will prevent DRY from distorting such sequences, and allow them to appear any number of times in the output. If you are building a chat interface that leverages DRY, you could do this automatically for your users as you know the character names already.

Demonstration

To show DRY in action, I have written a short chat script that strongly incentivizes the model to loop:

Detective: Where were you last night at 6 PM?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Did you know the victim personally?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have money problems?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have a criminal record?

Suspect: On the advice of my attorneys, I invoke

Here's how Mistral-7b-Instruct-v0.2 continues with all samplers disabled:

my Fifth Amendment right to not answer that question.

As expected, the model picks up the pattern and repeats itself.

Now let's use a traditional (multiplicative) repetition penalty of 1.3. We get:

my Fifth Amendment right to not answer that question.

Even though 1.3 is a very high value for the repetition penalty that clobbers English grammar, it doesn't stop the model from repeating itself if the structure of the text suggests it so strongly.

Now instead, we use DRY with parameters 2/0.8/1.75 (standard parameters recommended above). The model outputs (after some attempts that generate garbage):

secrecy of the grand jury proceedings, which includes my criminal history, if any.

DRY simply does not allow the model to repeat such a long sequence.

Note that this is an extreme test case for demonstration purposes. Combining a strong incentive to loop with a strong penalty for looping will often produce garbage. In practice, using DRY prevents such situations from occurring in the first place, and the output is much more natural.

TODO

[X] I have read the Contributing guidelines.
[X] More testing (I have rewritten this cleanly from scratch after hacking on the codebase while experimenting, so this version isn't as well tested as what I used previously).
[X] Make sure this works over the API.

Mar 10 '24 12:03 p-e-w

Some basic comments:

Have you compared how well this works vs the existing no_repeat_ngram_size parameter?
To end a chat turn, the model has to generate something like \nChiharu Yamada: or \nYou:. Is that penalized, such that the model is artificially forced to generate longer replies, or is sequence_breakers enough to prevent this artifact?
repetition_penalty_range should probably be considered in this parameter, just like it is considered in the existing repetition/frequence/presence penalty parameters.

Mar 10 '24 16:03 oobabooga

Have you compared how well this works vs the existing no_repeat_ngram_size parameter?

I must admit that although I probably did see that parameter in the Transformers docs at some point in the past, I have never used it and didn't even think of it while developing this.

That being said, no_repeat_ngram_size (which appears to completely forbid all n-gram repetitions over a certain length, and completely allow all below that length) strikes me as something that would produce very unnatural outputs, where suddenly the model slams into a concrete wall where the token it might strongly prefer above all others is hard-disallowed. By contrast, DRY steers the model away from repetition over several successive generation steps, finding the balance point where the model's tendency to repeat is overcome by the penalty. This allows "necessary" repetitions to occur (such as fixed turns of phrase) if the probability distribution is sufficiently skewed, while idle looping is smoothly avoided at an early stage.

no_repeat_ngram_size also appears to lack an equivalent to dry_sequence_breakers, which would make it borderline unusable in practice, just as DRY was before I introduced that parameter.

But now that you have made me (re-)aware of that parameter, I will definitely perform some experiments with it for comparison.

To end a chat turn, the model has to generate something like \nChiharu Yamada: or \nYou:. Is that penalized, such that the model is artificially forced to generate longer replies, or is sequence_breakers enough to prevent this artifact?

It is not penalized. \n is a sequence breaker, so Ch (the first token comprising Chiharu) isn't penalized at all since there is no preceding sequence that could previously occur in the input. The same is true for everything following :, which is also a sequence breaker. Also, sequence breakers themselves are never penalized, so \n etc. can always be freely generated (unlike with standard repetition penalty, which can lead to wall-of-text replies).

The only sequence that matters here is Chiharu Yamada (5 tokens in Mistral). With the standard parameters, aru will receive an additive penalty of 0.8, which shouldn't be a problem, but will grow rapidly from there. With very long names that are expected to be repeated verbatim in the output every time, this can become an issue, and I have noticed it a few times in my testing. This is of course inherent in every repetition penalty system, and I doubt there's an automated way to handle this, especially since the name can occur not only in the label but also in the message itself. In exceptional cases where this becomes enough of an issue to corrupt names, adding the first name to dry_sequence_breakers (which will automatically extract the last token comprising it) should suffice.

repetition_penalty_range should probably be considered in this parameter, just like it is considered in the existing repetition/frequence/presence penalty parameters.

Not doing that was actually intentional, as I don't believe verbatim repetition of long sequences is ever something the user wants, no matter how far back they occurred previously. But I can of course add it (probably as a separate parameter so it can be controlled independently of the standard repetition penalty, where that parameter makes much more sense to keep small).

Mar 10 '24 17:03 p-e-w

Update

Added a parameter to control the range over which DRY looks for matching sequences in the input, mirroring the classical repetition penalties.
More testing with both chat and creative writing. Confirmed that the recommended parameters work well for both use cases.
Confirmed that the parameters work over the API.
Did some experiments with no_repeat_ngram_size. As expected, that parameter is unusable for chat formats. Even without template markup, chat logs at minimum need to contain repeating structures like "\n\nName: ", which is already 6 tokens (more if the name is more complex). So to generate well-formed chat output, no_repeat_ngram_size must be at least 7. But that means that such pearls of GPT prose as her voice barely above a whisper cannot be penalized. And I certainly don't want to see such phrases twice in a chat (I don't even want to see them once, but that's not something a sampler can fix :shrug:). By comparison, DRY can easily prevent even shorter phrases from repeating. DRY can also emulate no_repeat_ngram_size by setting dry_multiplier to a huge number, and dry_allowed_length to no_repeat_ngram_size-1, which gives you essentially the original no_repeat_ngram_size plus the benefit of sequence breakers. Overall, I just don't think setting a hard limit on how long repeating sequences may be is the right approach for natural languages.

Mar 12 '24 04:03 p-e-w

For what it's worth, I've done a lot of experimentation with no_repeat_ngram_size in the past and I can confirm it's fairly useless in a chat context. It might be useful in other contexts, especially in contexts where the input is relatively small. But when a chat message history grows, using no_repeat_ngram_size typically leads to situations where the model is intentionally writing broken english (like writing "engglish" instead of "english"), where the brokenness of the language just grows more and more absurd over time. This seems to happen because in many cases (especially with smaller models) the model perceives repetitive output to be extremely likely - so likely, that even broken versions of the repetitive output appear more likely than some other alternative continuation of the text. So when we prevent the model from generating the exact same repetitive continuation to the text, it chooses to use a broken alternative version of the same repetitive text instead of choosing some more natural text.

I do not recommend using no_repeat_ngram_size except at very high values, if no other "circuit breaker" for repetition exists.

I have not tested this PR and I do not know how well this PR works in comparison.

Mar 12 '24 12:03 belladoreai

@p-e-w I really like this change. However, one thing I've noticed is that the generation speed decreased as I increased the dry_range, while using the exact same context. Is this something that you've experienced and/or is expected? Could also just be an issue on my end, or maybe even a model specific thing for Yi models.

Mar 25 '24 23:03 Hunterius8

@Hunterius8

Could you quantify that? What is your tokens/s with and without DRY?

On my dev machine, I'm seeing 4.99 tokens/s with DRY and 4.98 tokens/s without it. I'm running Mixtral Q5_K_M with 8192 context size, and dry_range = 0, meaning it goes over the full context window.

For DRY to noticeably impact the generation speed (assuming a baseline of no more than a few dozen tokens/s), the invocation would have to take tens of milliseconds. The matching operation starts with

match_indices = (input_ids_row[:-1] == last_token).nonzero()

which I believe should be GPU-accelerated. Afterwards, the number of tokens that must be checked "manually" is reduced dramatically, and should be in the low hundreds at most (often much less), which should take less than a millisecond. Not sure what's going on in your case yet.

Mar 29 '24 03:03 p-e-w

@p-e-w

Yeah, ran through a few generations again, here's the tokens/s with every sampler turned off. dry_off

Then for the next one I turned on just DRY and set the range to 2048. dry_2048

And for the last one I set the DRY range to 0. dry_full

On my end at least, it seems to have a pretty big impact on the generation speed. I'm wondering if it isn't because I'm using an exl2 quant of the yi-34b-200k model, so I'll retry with a gguf model later.

Update: Got the same results for the gguf models I tested. The issue also persisted on a completely fresh install.

Mar 30 '24 14:03 Hunterius8

I think you can write an academic paper about it.

Apr 05 '24 14:04 Touch-Night

@Hunterius8

I see, that's a lot more context than I've ever run, combined with a pretty high base performance, so this is probably the reason I don't notice it in my own setup.

That being said, I'm not sure what can be done about it, because I don't think the algorithm can be vectorized the way other samplers are. This isn't really a bug, it's just how long it takes to do that thing. If someone has a magic fix to make it faster then I'm all ears. Personally, I would run DRY even if it cost me half the performance, because the output is so much better. But it's disabled by default so everyone can make their own choice.

Apr 06 '24 06:04 p-e-w

I'm not sure what can be done about it, because I don't think the algorithm can be vectorized the way other samplers are. This isn't really a bug, it's just how long it takes to do that thing. If someone has a magic fix to make it faster then I'm all ears.

The algorithm doesn't have to be vectorized, it can (most likely) be optimized in other ways (by reducing the asymptotic time complexity).

That said, 19k context is massive, and if the sampler currently slows the generation only by 50% at such a huge context, then I don't think it's worth it to add complexity to the codebase by optimizing the algorithm.

And all that said, if @oobabooga feels that it should be optimized for performance, I should be able to help with this.

Apr 06 '24 13:04 belladoreai

Honestly, all is needed, is a warning that performance would be lowered. This thing is crucial, we need it asap. Also is it possible to add smth like a vocabulary of phrases and words that we want to have penalized right off the bat?

Apr 09 '24 11:04 Priestru

I have made the following changes:

Make it a LogitsProcessor like other repetition penalties
Reuse the repetition_penalty_range parameter (I don't want to add a new parameter that does the same thing, and there is no reason to use more than 1 type of repetition penalty at the same time)
Minor UI changes

My remaining concerns are two:

The dry_sequence_breakers format, as commented above
About the base and multiplier parameters, is base really needed? Is there a reason to not hardcode it at 1.75 and leave only multiplier for simplicity and less parameters?

Apr 11 '24 05:04 oobabooga

Also, any tests on whether things still work as expected after my changes are welcome.

Apr 11 '24 05:04 oobabooga

Make it a LogitsProcessor like other repetition penalties

That means losing control over DRY's position in the sampler stack, right? I think it can be valuable to be able to choose when the penalty is applied (that goes for the traditional repetition penalty as well).

The most important thing is that the DRY penalty is applied before any truncation samplers. Is that still guaranteed to be true if it is a LogitsProcessor?

Reuse the repetition_penalty_range parameter (I don't want to add a new parameter that does the same thing, and there is no reason to use more than 1 type of repetition penalty at the same time)

Actually, I sometimes combine DRY with a very small standard repetition penalty such as 1.03 nowadays, to curb the tendency of some models to frequently use the same terms. Taking into account the performance impact noted by @Hunterius8, this does provide some justification for keeping the parameters separate.

Minor UI changes

:+1: Agreed, this order makes more sense. The parameter that controls whether DRY is active now comes first.

About the base and multiplier parameters, is base really needed? Is there a reason to not hardcode it at 1.75 and leave only multiplier for simplicity and less parameters?

I really dislike hardcoding magic values. The recommended value of 1.75 is the result of some experimentation, and I have used values between 1.2 and 3.0 with some success. Considering how sensitive the growth of the penalty is to this parameter, I would prefer to keep it.

Also, any tests on whether things still work as expected after my changes are welcome.

I will run the branch with your changes for a few days and then let you know if there are any problems.

Apr 13 '24 07:04 p-e-w

@Priestru

Also is it possible to add smth like a vocabulary of phrases and words that we want to have penalized right off the bat?

I plan to implement exactly that in a followup. My long-term vision is to have community-maintained phrasebooks of things like fanfiction clichés (her tears were clinging to her eyelashes like morning dew etc.) that people can select in a frontend like SillyTavern, which will then be passed to DRY in order to prevent such garbage from ever appearing in the output.

Apr 13 '24 07:04 p-e-w

I would agree that there's merit to having separate range parameters for DRY and the regular repetition penalties, not just for performance reasons, but also because I believe that those two parameters have very different sweet spots, when using both at the same time. From my experimentation, using a low presence penalty with a range of about 1000 tokens in conjunction with a much higher range DRY, somewhere around 8000 tokens, works really well on Yi models, for example.

Just using DRY, there's no way to penalize the repetition of the tokens that follow the DRY sequence breakers. Applying any of the regular repetition penalties over the same range that works really well for DRY will probably penalize too many tokens and hurt the output quality.

Apr 14 '24 05:04 Hunterius8

Why this thing stuck being never implemented if it worked AMAZING a month ago. Every time i read barely above a whisper i login here to check if this thing added or not, and it never did. Manipulating code for this thing to work on my local becomes increasingly hard especially after no value commits has been added to this pull request. I have to choose between other updates and this. @p-e-w is there any chance your magnificent creation could find it's way to production? Maybe something else supports it?

Apr 22 '24 16:04 Priestru

@l3utterfly is porting DRY to llama.cpp: https://github.com/ggerganov/llama.cpp/pull/6839

Apr 24 '24 03:04 p-e-w

@oobabooga

Could you give me a hint on how to proceed here? Do you plan to merge this PR? If so, what are the remaining steps?

Apr 26 '24 01:04 p-e-w

i would also like to know. I've been watching this PR for a while and i hope this solves many of the repetitions issues i am having even with the current options available in the UI.

Apr 26 '24 04:04 YakuzaSuske

I'm waiting for the dry_sequence_breakers syntax change requested above to merge this PR.

May 11 '24 18:05 oobabooga

@oobabooga

Could you reply to the concern I've raised above regarding that syntax? This will define the API, and I'm struggling to see a clean way for clients to build the proposed syntax. In fact, the easiest method seems to be to first serialize to JSON, then trim off the brackets, which strongly suggests that it should simply be a JSON array to begin with.

See also my comment regarding sampler order. DRY must come before any truncation samplers. Is that still guaranteed with it being a LogitsProcessor?

May 12 '24 07:05 p-e-w

It's inconvenient to tell people to write a [ and then a ] in the UI. Asking for "strings written between "" and separated by commas" is weird enough.

The parameter can still be written this way in the UI and optionally as a JSON array in the API, similar to how the sampler order parameter is implemented.

DRY must come before any truncation samplers. Is that still guaranteed with it being a LogitsProcessor?

Yes, that is. Regarding sampler order of repetition penalties, that's a different subject (it's not available even for the current repetiton penalties), and it can be handled in a separate PR. Note that LogitsProcessors (which include grammar) are used even in Contrastive Search, which does not use any of the logits warpers parameters. Also in Beam Search but I never got that to work properly with transformers.

May 12 '24 17:05 oobabooga

@oobabooga

PR updated!

dry_sequence_breakers can now be specified either as a comma-separated list of quoted strings, or as a JSON array. This works both in the UI and over the API. I believe this is the most flexible solution.
Documented the parameters in the UI.
Fixed a small cosmetic issue left over from the previous refactor.

May 14 '24 12:05 p-e-w

Thanks for the updates @p-e-w, the PR looks good to merge now. I didn't have time to review it earlier.

I have made a brief test by providing a character card with the same input over and over again with the multiplier set to 0.8, and there was zero repetition after some 10 attempts. This seems like a breahtrough in making conversations more natural.

The downside as others have noted is the big performance hit. If anyone has an idea how to make the operations faster (@p-e-w, @belladoreai), please feel free to submit a new PR.

May 20 '24 02:05 oobabooga

I fixed the performance issues in this PR: https://github.com/oobabooga/text-generation-webui/pull/6047

May 23 '24 16:05 belladoreai

DRY seems to be unable to take effect through the API. I used pew's fork before, and DRY can be applied through API in SillyTavern. However, the current main merged version does not seem to be able to take effect through the API. I tried belladoreai's fork, but it does not seem to take effect either.

If i ask AI repeat "Okay" 15 times in tgw, it will stop after 4~5 times or like this.

But i ask in ST, I got this.

May 30 '24 17:05 yamosin

the current main merged version does not seem to be able to take effect through the API

Clarifying the current branch situation:

The main branch does not have any version of DRY
pew's fork has been merged into the dev branch
My changes are in #6053 and have not been merged anywhere yet

I haven't tested DRY via API. If there was some issue before my changes, then that issue still remains. But if DRY was working correctly via API in the dev branch, then my changes in #6053 will not break it.

@yamosin I would be happy to troubleshoot this further if you can grab the API request that SillyTavern sends to TGW and copypaste it here? I'm especially interested in how you are defining DRY parameters inside SillyTavern.

May 30 '24 17:05 belladoreai

@belladoreai

sry use wrong word, dev version I direct download your fork from https://github.com/belladoreai/text-generation-webui/tree/dev-dry-optimization2 I dont defining DRY parameters in ST, it can use tgw DRY parameters before I guess? Since T/S degraded, I'm just testing it to make sure it works, not really using it. Set High dry_multiplier value in tgw, and ST will get crashed reply, then set normal the reply work normal, so I think that means DRY worked Seeing your fork got me interested, then found it wouldn't take effect via the api

May 30 '24 18:05 yamosin

I dont defining DRY parameters in ST, it can use tgw DRY parameters before I guess?

The default value for dry_multiplier is 0, so if you don't set the parameter to some other value, then DRY will not be used.

Note that since you are testing via the API, you need to pass the parameter in the API call. If you change parameters in the UI, those affect only UI generations, not API.

You mentioned that some old version of DRY used to work for you over the API. I am guessing that the default parameter in that old version was different, and now it no longer works because the default parameters have been changed?

May 30 '24 18:05 belladoreai

text-generation-webui text-generation-webui copied to clipboard

DRY: A modern repetition penalty that reliably prevents looping

How it works

Sequence breakers

How to use

Demonstration

TODO

Update

text-generation-webui
text-generation-webui copied to clipboard