text-generation-webui
text-generation-webui copied to clipboard
Flexgen doesn't stop generating at "\n" in chat mode
Hi,
I have a weird problem when I launch a flexgen model using the no-stream option, it absolutely kills performances.
Without no-stream :
With no-stream :
The weird thing is that the result matches exactly the number of max_new_tokens I set (here I set it to 200/100/75/200) so it looks like it doesn't know that the answer was fully generated or something? The number of tokens written also doesn't correspond at all to what the text is showing. It will write 200 with just one word sometime.
Flexgen doesn't support the fancy stopping_criteria parameter that allows the text generation to stop as soon as
'\n{your_name}
is generated, but it at least support stop=n
, allowing the generation to stop at a new line character.
I have implemented it as in the chatbot example but it seems to not be working properly
https://github.com/oobabooga/text-generation-webui/commit/044b963987eb8871c1d6bc674269891394ef0655
Yeah it doesn't stop even now sadly. Is it actually sending \n? Or maybe \r\n ?
I think that this fixes the problem:
https://github.com/oobabooga/text-generation-webui/commit/6e843a11d64ec0898a1cb6f2cc9a81619038db81
I get no more (You)'s but it still generates the full amount.
Some more benchmarks.
13b Compressed: --compress-weight --percent 100 0 100 0 100 0 ALL GPU
Output generated in 40.12 seconds (0.62 it/s, 200 tokens)
Output generated in 13.13 seconds (0.76 it/s, 80 tokens)
Output generated in 14.85 seconds (0.67 it/s, 80 tokens)
13b Compressed: --compress-weight --percent 100 0 0 100 0 100 CPU
Output generated in 24.34 seconds (0.41 it/s, 80 tokens)
Output generated in 27.17 seconds (0.37 it/s, 80 tokens)
Output generated in 35.32 seconds (0.28 it/s, 80 tokens)
And 30B
30b Compressed: --compress-weight --percent 100 0 100 0 100 0 ALL GPU
Output generated in 205.92 seconds (0.05 it/s, 80 tokens)
Output generated in 42.36 seconds (0.24 it/s, 80 tokens)
Output generated in 60.85 seconds (0.16 it/s, 80 tokens)
30b Compressed: --compress-weight --percent 100 0 0 100 0 100 CPU
Output generated in 62.28 seconds (0.16 it/s, 80 tokens)
Output generated in 39.61 seconds (0.25 it/s, 80 tokens)
Output generated in 30.47 seconds (0.33 it/s, 80 tokens)
It says that it generates the whole max_new_tokens
count but that's a bug in this token counter. FlexGen outputs are padded to the maximum length.
That explains the variations in the time. Well now at least a 13b is fully usable.
I believe that this has been solved, let me know if not