Manimap comments

Results 37 comments of


                                            Manimap

The WebUI wont launch (The file may be malicious, so the program is not going to read it.)

> I have the same issue. For me this can be easily reproduced right after triggering a CUDA OOM (though for me it still shows available VRAM #4541) by simply...

Repeated replies in chat mode with flexgen

Is there any reason you used opt-30b-iml-max instead of opt-30b ? I see you get nice speed compared to me (I get 0.09it/s on a 4090...), but I confirm I...

Repeated replies in chat mode with flexgen

Oh, my speed problem is related to the no-stream config. I'm not sure I see a huge difference between these 2 models, but my already created conversation seems to continue...

Repeated replies in chat mode with flexgen

Yeah I referenced the problem here : https://github.com/oobabooga/text-generation-webui/issues/105

Repeated replies in chat mode with flexgen

@MetaIX I just got the same message thing, I raised the only parameter I could (temperature) and regenerated the text, and it gave me another one.

Support for LLaMA models

Someone made a fork of llama github that apparently runs in 8bit : https://github.com/tloen/llama-int8 Zero idea if it works or anything.

Implement ZeRO inference

Is there any actual benefit in using bfloat16 if the card supports it (Ampere & Lovelace) ? Better output? Better speed?

> @Manimap, the [docs](https://huggingface.co/docs/transformers/main_classes/deepspeed#custom-deepspeed-zero-inference) claim it's faster. There's also a caveat for fp16: > > > enable bf16 if you own an Ampere or a newer GPU to make things...

[Feature Request] Multi swap

> I think he was suggesting to send a new request to the api after first one is finished without waiting for a swipe from the user. So tavern would...

[Feature Request] Multi swap

> This is not how the APIs work. There is a reason why we disable swiping during streamed responses. You cannot make the API send multiple replies to your endpoint...