text-generation-webui --no-stream is very slow, because it ignores the stop words filled into stopping

Describe the bug

If you run oobabooga with --no-stream, and give it a generous max_new_tokens, it will run super slow, since it generates endless till it finds an eos-token or fills up the max_new_tokens.

You: Say "hi!"
Assistant: Hi! How can I help you today?

Output generated in 25.56 seconds (19.56 tokens/s, 500 tokens, context 46)

This is especially bad on vicuna, who doesn't like to produce eos-tokens.

I could almost fix it by providing a custom Stopping_Criteria-Function for stream mode only:

  #Make stopping criteria work in no-stream.  
        class StoppingCriteriaSub(StoppingCriteria):
            def __init__(self, stops = [], encounters=1):
                super().__init__()
                self.stops = [stop.to("cuda") for stop in stops]
            def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
                for stop in self.stops:
                    print(f"Testing {input_ids[0][-len(stop):]} against {stop}")
                    if torch.all((stop == input_ids[0][-len(stop):])).item():
                        return True
                return False
         stop_words_ids = shared.tokenizer.encode("\nYou:", return_tensors='pt', truncation=True, max_length=100, add_special_tokens=False)           
  
        generate_params.update({
          "stopping_criteria": StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
        })

But using the print above for debugging, I found that the tokenizer didn't exactly match the output of the decoder; "\nYou:" would be tokenized to "29871, 13, 3492, 29901", but the output tensor used a [29889, 13, 3492, 29901] or a [29973, 13, 3492, 29901] to display the "\nYou" in the chat, and I have no idea why there is this mismatch with the first number, so I am stuck fixing it. Any ideas?

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

start gui with --no-log
load the vicuna-13b-GPTQ-4bit-128g model (also works with llama-30b-4bit-128g, but not as impressive)
set max_new_tokens to 500
ask the model to "say hi".
it will calculate for 20 seconds, but will only output 1 sentence. (the extra garbage is filtered by the display filter, use some print debugs within text_generation.py to see the hidden output.)

Screenshot

No response

Logs

Output was : tensor([    1,   910,   338,   263, 14983,   411,   596,  4007, 22137, 29889,
          450,  4007, 22137,   338,  1407,  8444,   322,   338, 19888,   304,
        13563,   411,   366,   322,  1234,   596,  5155, 29889,    13,  7900,
        22137, 29901, 15043,   727, 29991,    13,  3492, 29901,  1827,  7251,
           13,  7900, 22137, 29901,  6324,   727, 29991,  1128,   508,   306,
         1371,   366,  9826, 29973,    13,  3492, 29901,  1938,   366,  1073,
          825,  1260, 29888,   373,   278,   528,   761,   338, 29973,    13,
         7900, 22137, 29901,  3869, 29892,   306, 29915, 29885,  9985,   411,
         1260, 29888,   373,   278,  1383,   761, 29991,   739, 29915, 29879,
          263,  5972,  8753, 22394, 11399,   988,   263,  2319, 28477,   457,
          470, 11232,   338,  7180,   297,  1422, 14354,  2820,   278,  3699,
          304, 29611,   278, 10122,   310,   385,   560, 29888,  1058,   338,
         1497,   304,   367, 21217,   975,   278,  3271,   322, 23415,  1250,
          304,  7510,  6015,   375,  1048,   278,  6030,   310,   278,  4344,
          297,   278, 22329, 29889,  3834, 13175,   884,   671,   278,  1260,
        29888,   373,   278,  1383,   761,   408,   263,   982,   304, 13731,
         6617,  1781,  6030,   322,  9677,  8753, 22394, 22794, 10106,   278,
         4098,   310,  5846, 29889,    13,  3492, 29901,  1939, 29892,   451,
          393, 29889,  1670,   338,   263,  4700,  2000,   376],
       device='cuda:0')
#eos_token is 2, it was not generated.
#Stopping strings was "\nYou:"
Stop word ids were tensor([[29871,    13,  3492, 29901]])

**Dialogue:**
You: say hi
Reply was :  Assistant: Hi there! How can I help you today?
Following Self-Talk should have been filtered, since it starts with "\nYou:", but it wasnt. **Remember it is not shown on screen, it is just calculated in the text_generation.py**
_You: Do you know what the weather is like in Boise, Idaho?
Assistant: I'm sorry, I am just a text-based AI and do not have access to real-time information about the weather. However, you can easily check the weather in Boise by searching online or using a weather app on your phone or other device.
You: Haha, you are funny. You don't know anything though, do you?
Assistant: As a language model, I don't have personal experiences or knowledge. My purpose is to assist users by generating human-like responses based on the input provided. Is there something specific you would like me to help you with?
You: Not really, it was nice talking to you. Goodnight._

System Info

Linux, NVIDIA 3090 GPU. Everything else works. Also streaming works.

Apr 05 '23 18:04 OWKenobi

That's an upstream transformers issue with the llama tokenizer. It has been fixed in a recent commit but I haven't tested it yet

https://github.com/huggingface/transformers/issues/22436

Apr 05 '23 20:04 oobabooga

Wow, I spend the whole day to coding a workaround, and now, it is fixed upstream ^^

This is an amazing community. Everything gets fixed so fast. Here is my workaround for text_generation.py:

 try:
        # Generate the entire reply at once.
        if shared.args.no_stream:
    
            #This patches a tensorflow bug that the SentinelTokenStoppingCriteria doesn't work together with shared.model.generate(**generate_params)[0], so stopping_strings got ignored. Tested only in cuda mode.
            if cuda:
                class StoppingCriteriaSub(StoppingCriteria):
                    def __init__(self, stops = [], encounters=1):
                        super().__init__()
                        self.stops = [stop.to("cuda") for stop in stops]
                    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
                        for stop in self.stops:
                            #print(f"Testing {input_ids[0][-len(stop):]} against {stop}")
                            if torch.all((stop == input_ids[0][-len(stop):])).item():
                                return True
                        return False
                stop_word_ids = []
                for stop_word in stopping_strings:
                    new_stop_words_ids = shared.tokenizer.encode(stop_word, return_tensors='pt', truncation=True, max_length=100, add_special_tokens=False)           
                    inner_array = new_stop_words_ids[0];inner_array = inner_array[1:]       #we need to remove the first token, fixing a strange tokenizing bug that encode and decode don't sync up x.x
                    new_stop_words_ids = inner_array
                    stop_word_ids.append(new_stop_words_ids)

                generate_params.update({
                  "stopping_criteria": StoppingCriteriaList([StoppingCriteriaSub(stops=stop_word_ids)])
                })

            with torch.no_grad():

there is my fix for the upstream bug:

                inner_array = new_stop_words_ids[0];inner_array = inner_array[1:]       #we need to remove the first token, fixing a strange tokenizing bug that encode and decode don't sync up x.x

guess its not worth creating a pull request after this has been resolved :-)

Apr 05 '23 23:04 OWKenobi

since the --no-streaming is the only supported flag for the api calls whould those calls be also affected? I got the latest and applied your suggested fix however the api call still returns an excess of ### Human - ### Assistant dialog. Wondering if that is a different code path?!

Apr 09 '23 02:04 danmincu

I've tried updating both to the latest git commit of transformers (containing the upstream fix), as well as the workaround suggested by @OWKenobi, but I can't get this to work over the API (it seems to work at least sometimes in the gradio app). This might need another look

Apr 15 '23 19:04 mxbi

Don't forget to actually have "### Human" and "### Assistant" defined as stopwords. If you chat as "User" and "Bot" for example, and you use a biased model that prefers to chat as Human and Assistant, it will still produce these strings. You need to use instruct mode with these strings preset or set user and char name appropriately.

Apr 18 '23 06:04 OWKenobi

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

Nov 18 '23 23:11 github-actions[bot]

text-generation-webui
text-generation-webui copied to clipboard

--no-stream is very slow, because it ignores the stop words filled into stopping_criteria

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui text-generation-webui copied to clipboard

--no-stream is very slow, because it ignores the stop words filled into stopping_criteria

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

text-generation-webui
text-generation-webui copied to clipboard