text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Repeated replies in chat mode with flexgen

Open MetaIX opened this issue 2 years ago • 11 comments

I've successfully loaded opt-30b-iml-max via python server.py --model opt-iml-max-30b --flexgen --compress-weight --cai-chat --percent 100 0 100 0 100 0 on a 4090.

Unfortunately, the bot appears to be repeating the same greeting message after I'm 6 messages in.

Untitled

Also, trying to switch bots when this happens gives me a seemingly infinite loading time.

image

No errors appear on the console.

image

image

What could be causing this? Happens with multiple bots.

MetaIX avatar Feb 22 '23 08:02 MetaIX

Is there any reason you used opt-30b-iml-max instead of opt-30b ? I see you get nice speed compared to me (I get 0.09it/s on a 4090...), but I confirm I get a continuation on my current story, no repetition, although the speed is too atrocious for me to test much. I'll try opt-30b-iml-max and tell you if I get the same speed as you.

Manimap avatar Feb 22 '23 09:02 Manimap

Oh, my speed problem is related to the no-stream config. I'm not sure I see a huge difference between these 2 models, but my already created conversation seems to continue fine.

Manimap avatar Feb 22 '23 11:02 Manimap

I'm just testing behaviors with different models. Honestly, for the 2 messages it managed to answer, it wasn't bad at all. 0.09 there must be something wrong. I'll also test opt-30b soon hopefully I'll get similar speeds to what I have.

MetaIX avatar Feb 22 '23 11:02 MetaIX

That's interesting. I wonder what could be causing it then.

MetaIX avatar Feb 22 '23 11:02 MetaIX

Yeah I referenced the problem here : https://github.com/oobabooga/text-generation-webui/issues/105

Manimap avatar Feb 22 '23 11:02 Manimap

This looks like a silent CUDA out of memory error. I will make some experiments with 30b models later and will report my findings.

oobabooga avatar Feb 22 '23 14:02 oobabooga

@MetaIX I just got the same message thing, I raised the only parameter I could (temperature) and regenerated the text, and it gave me another one.

Manimap avatar Feb 22 '23 15:02 Manimap

@Ph0rk0z it is true that the token limit is being generated 100% of the time in flexgen mode. This is really annoying. I seem to be using the stop parameter exactly as in the official FlexGen chatbot example, but it doesn't do anything.

oobabooga avatar Feb 25 '23 21:02 oobabooga

I think that this fixes the issue of FlexGen not stopping the generation at a new line character

https://github.com/oobabooga/text-generation-webui/commit/6e843a11d64ec0898a1cb6f2cc9a81619038db81

oobabooga avatar Feb 26 '23 03:02 oobabooga

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

github-actions[bot] avatar Apr 03 '23 23:04 github-actions[bot]

Cargué con éxito opt-30b-iml-max a través de python server.py --model opt-iml-max-30b --flexgen --compress-weight --cai-chat --percent 100 0 100 0 100 0 en un 4090.

Desafortunadamente, el bot parece estar repitiendo el mismo mensaje de saludo después de 6 mensajes.

Intitulado

Además, tratar de cambiar de bots cuando esto sucede me da un tiempo de carga aparentemente infinito.

imagen

No aparecen errores en la consola.

imagen

imagen

¿Qué podría estar causando esto? Ocurre con múltiples bots.

Abierto

mcastillobalderas avatar Apr 04 '23 07:04 mcastillobalderas