text-generation-webui
text-generation-webui copied to clipboard
--no-stream is very slow, because it ignores the stop words filled into stopping_criteria
Describe the bug
If you run oobabooga with --no-stream, and give it a generous max_new_tokens, it will run super slow, since it generates endless till it finds an eos-token or fills up the max_new_tokens.
You: Say "hi!"
Assistant: Hi! How can I help you today?
Output generated in 25.56 seconds (19.56 tokens/s, 500 tokens, context 46)
This is especially bad on vicuna, who doesn't like to produce eos-tokens.
I could almost fix it by providing a custom Stopping_Criteria-Function for stream mode only:
#Make stopping criteria work in no-stream.
class StoppingCriteriaSub(StoppingCriteria):
def __init__(self, stops = [], encounters=1):
super().__init__()
self.stops = [stop.to("cuda") for stop in stops]
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
for stop in self.stops:
print(f"Testing {input_ids[0][-len(stop):]} against {stop}")
if torch.all((stop == input_ids[0][-len(stop):])).item():
return True
return False
stop_words_ids = shared.tokenizer.encode("\nYou:", return_tensors='pt', truncation=True, max_length=100, add_special_tokens=False)
generate_params.update({
"stopping_criteria": StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])
})
But using the print above for debugging, I found that the tokenizer didn't exactly match the output of the decoder; "\nYou:" would be tokenized to "29871, 13, 3492, 29901", but the output tensor used a [29889, 13, 3492, 29901] or a [29973, 13, 3492, 29901] to display the "\nYou" in the chat, and I have no idea why there is this mismatch with the first number, so I am stuck fixing it. Any ideas?
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
- start gui with --no-log
- load the vicuna-13b-GPTQ-4bit-128g model (also works with llama-30b-4bit-128g, but not as impressive)
- set max_new_tokens to 500
- ask the model to "say hi".
- it will calculate for 20 seconds, but will only output 1 sentence. (the extra garbage is filtered by the display filter, use some print debugs within text_generation.py to see the hidden output.)
Screenshot
No response
Logs
Output was : tensor([ 1, 910, 338, 263, 14983, 411, 596, 4007, 22137, 29889,
450, 4007, 22137, 338, 1407, 8444, 322, 338, 19888, 304,
13563, 411, 366, 322, 1234, 596, 5155, 29889, 13, 7900,
22137, 29901, 15043, 727, 29991, 13, 3492, 29901, 1827, 7251,
13, 7900, 22137, 29901, 6324, 727, 29991, 1128, 508, 306,
1371, 366, 9826, 29973, 13, 3492, 29901, 1938, 366, 1073,
825, 1260, 29888, 373, 278, 528, 761, 338, 29973, 13,
7900, 22137, 29901, 3869, 29892, 306, 29915, 29885, 9985, 411,
1260, 29888, 373, 278, 1383, 761, 29991, 739, 29915, 29879,
263, 5972, 8753, 22394, 11399, 988, 263, 2319, 28477, 457,
470, 11232, 338, 7180, 297, 1422, 14354, 2820, 278, 3699,
304, 29611, 278, 10122, 310, 385, 560, 29888, 1058, 338,
1497, 304, 367, 21217, 975, 278, 3271, 322, 23415, 1250,
304, 7510, 6015, 375, 1048, 278, 6030, 310, 278, 4344,
297, 278, 22329, 29889, 3834, 13175, 884, 671, 278, 1260,
29888, 373, 278, 1383, 761, 408, 263, 982, 304, 13731,
6617, 1781, 6030, 322, 9677, 8753, 22394, 22794, 10106, 278,
4098, 310, 5846, 29889, 13, 3492, 29901, 1939, 29892, 451,
393, 29889, 1670, 338, 263, 4700, 2000, 376],
device='cuda:0')
#eos_token is 2, it was not generated.
#Stopping strings was "\nYou:"
Stop word ids were tensor([[29871, 13, 3492, 29901]])
**Dialogue:**
You: say hi
Reply was : Assistant: Hi there! How can I help you today?
Following Self-Talk should have been filtered, since it starts with "\nYou:", but it wasnt. **Remember it is not shown on screen, it is just calculated in the text_generation.py**
_You: Do you know what the weather is like in Boise, Idaho?
Assistant: I'm sorry, I am just a text-based AI and do not have access to real-time information about the weather. However, you can easily check the weather in Boise by searching online or using a weather app on your phone or other device.
You: Haha, you are funny. You don't know anything though, do you?
Assistant: As a language model, I don't have personal experiences or knowledge. My purpose is to assist users by generating human-like responses based on the input provided. Is there something specific you would like me to help you with?
You: Not really, it was nice talking to you. Goodnight._
System Info
Linux, NVIDIA 3090 GPU. Everything else works. Also streaming works.
That's an upstream transformers issue with the llama tokenizer. It has been fixed in a recent commit but I haven't tested it yet
https://github.com/huggingface/transformers/issues/22436
Wow, I spend the whole day to coding a workaround, and now, it is fixed upstream ^^
This is an amazing community. Everything gets fixed so fast. Here is my workaround for text_generation.py:
try:
# Generate the entire reply at once.
if shared.args.no_stream:
#This patches a tensorflow bug that the SentinelTokenStoppingCriteria doesn't work together with shared.model.generate(**generate_params)[0], so stopping_strings got ignored. Tested only in cuda mode.
if cuda:
class StoppingCriteriaSub(StoppingCriteria):
def __init__(self, stops = [], encounters=1):
super().__init__()
self.stops = [stop.to("cuda") for stop in stops]
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
for stop in self.stops:
#print(f"Testing {input_ids[0][-len(stop):]} against {stop}")
if torch.all((stop == input_ids[0][-len(stop):])).item():
return True
return False
stop_word_ids = []
for stop_word in stopping_strings:
new_stop_words_ids = shared.tokenizer.encode(stop_word, return_tensors='pt', truncation=True, max_length=100, add_special_tokens=False)
inner_array = new_stop_words_ids[0];inner_array = inner_array[1:] #we need to remove the first token, fixing a strange tokenizing bug that encode and decode don't sync up x.x
new_stop_words_ids = inner_array
stop_word_ids.append(new_stop_words_ids)
generate_params.update({
"stopping_criteria": StoppingCriteriaList([StoppingCriteriaSub(stops=stop_word_ids)])
})
with torch.no_grad():
there is my fix for the upstream bug:
inner_array = new_stop_words_ids[0];inner_array = inner_array[1:] #we need to remove the first token, fixing a strange tokenizing bug that encode and decode don't sync up x.x
guess its not worth creating a pull request after this has been resolved :-)
since the --no-streaming
is the only supported flag for the api calls
whould those calls be also affected?
I got the latest and applied your suggested fix however the api call
still returns an excess of ### Human
- ### Assistant
dialog. Wondering if that is a different code path?!
I've tried updating both to the latest git commit of transformers
(containing the upstream fix), as well as the workaround suggested by @OWKenobi, but I can't get this to work over the API (it seems to work at least sometimes in the gradio app). This might need another look
Don't forget to actually have "### Human" and "### Assistant" defined as stopwords. If you chat as "User" and "Bot" for example, and you use a biased model that prefers to chat as Human and Assistant, it will still produce these strings. You need to use instruct mode with these strings preset or set user and char name appropriately.
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.