Manimap
Manimap
> I have the same issue. For me this can be easily reproduced right after triggering a CUDA OOM (though for me it still shows available VRAM #4541) by simply...
Is there any reason you used opt-30b-iml-max instead of opt-30b ? I see you get nice speed compared to me (I get 0.09it/s on a 4090...), but I confirm I...
Oh, my speed problem is related to the no-stream config. I'm not sure I see a huge difference between these 2 models, but my already created conversation seems to continue...
Yeah I referenced the problem here : https://github.com/oobabooga/text-generation-webui/issues/105
@MetaIX I just got the same message thing, I raised the only parameter I could (temperature) and regenerated the text, and it gave me another one.
Someone made a fork of llama github that apparently runs in 8bit : https://github.com/tloen/llama-int8 Zero idea if it works or anything.
Is there any actual benefit in using bfloat16 if the card supports it (Ampere & Lovelace) ? Better output? Better speed?
> @Manimap, the [docs](https://huggingface.co/docs/transformers/main_classes/deepspeed#custom-deepspeed-zero-inference) claim it's faster. There's also a caveat for fp16: > > > enable bf16 if you own an Ampere or a newer GPU to make things...
> I think he was suggesting to send a new request to the api after first one is finished without waiting for a swipe from the user. So tavern would...
> This is not how the APIs work. There is a reason why we disable swiping during streamed responses. You cannot make the API send multiple replies to your endpoint...