text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Flexgen doesn't stop generating at "\n" in chat mode

Open Manimap opened this issue 2 years ago • 6 comments

Hi,

I have a weird problem when I launch a flexgen model using the no-stream option, it absolutely kills performances. Without no-stream : image

With no-stream : image

The weird thing is that the result matches exactly the number of max_new_tokens I set (here I set it to 200/100/75/200) so it looks like it doesn't know that the answer was fully generated or something? The number of tokens written also doesn't correspond at all to what the text is showing. It will write 200 with just one word sometime.

Manimap avatar Feb 22 '23 11:02 Manimap

Flexgen doesn't support the fancy stopping_criteria parameter that allows the text generation to stop as soon as

'\n{your_name}

is generated, but it at least support stop=n, allowing the generation to stop at a new line character.

I have implemented it as in the chatbot example but it seems to not be working properly

https://github.com/oobabooga/text-generation-webui/commit/044b963987eb8871c1d6bc674269891394ef0655

oobabooga avatar Feb 22 '23 14:02 oobabooga

Yeah it doesn't stop even now sadly. Is it actually sending \n? Or maybe \r\n ?

Manimap avatar Feb 22 '23 14:02 Manimap

I think that this fixes the problem:

https://github.com/oobabooga/text-generation-webui/commit/6e843a11d64ec0898a1cb6f2cc9a81619038db81

oobabooga avatar Feb 26 '23 03:02 oobabooga

I get no more (You)'s but it still generates the full amount.

Some more benchmarks.

13b Compressed: --compress-weight --percent 100 0 100 0 100 0 ALL GPU

Output generated in 40.12 seconds (0.62 it/s, 200 tokens)
Output generated in 13.13 seconds (0.76 it/s, 80 tokens)
Output generated in 14.85 seconds (0.67 it/s, 80 tokens)

13b Compressed: --compress-weight --percent 100 0 0 100 0 100 CPU

Output generated in 24.34 seconds (0.41 it/s, 80 tokens)
Output generated in 27.17 seconds (0.37 it/s, 80 tokens)
Output generated in 35.32 seconds (0.28 it/s, 80 tokens)

And 30B

30b Compressed: --compress-weight --percent 100 0 100 0 100 0 ALL GPU

Output generated in 205.92 seconds (0.05 it/s, 80 tokens)
Output generated in 42.36 seconds (0.24 it/s, 80 tokens)
Output generated in 60.85 seconds (0.16 it/s, 80 tokens)

30b Compressed: --compress-weight --percent 100 0 0 100 0 100 CPU

Output generated in 62.28 seconds (0.16 it/s, 80 tokens)
Output generated in 39.61 seconds (0.25 it/s, 80 tokens)
Output generated in 30.47 seconds (0.33 it/s, 80 tokens)

Ph0rk0z avatar Feb 26 '23 18:02 Ph0rk0z

It says that it generates the whole max_new_tokens count but that's a bug in this token counter. FlexGen outputs are padded to the maximum length.

oobabooga avatar Feb 26 '23 19:02 oobabooga

That explains the variations in the time. Well now at least a 13b is fully usable.

Ph0rk0z avatar Feb 27 '23 13:02 Ph0rk0z

I believe that this has been solved, let me know if not

oobabooga avatar Mar 29 '23 03:03 oobabooga