LoopControl
LoopControl
@mazzzystar Not sure on that but it definitely seems wrong. My `.bin` file is 26M for a 13B model for me so I'd expect the 7B to be a few...
I'd love to see some separation or even the possibility to not run the model with this repo and instead just use the sveltekit app + mongo with an API...
:+1: Would love to see support for replit 1.3B
@maozdemir could you look into merging this PR please? I’d love to try privategpt but I wish to use it with text-generation-webui instead of llama.cpp
+1 for lora support (ideally 4 bit lora support please as output from https://github.com/johnsmith0031/alpaca_lora_4bit !)
Yep, 4bit inference with bnb is super slow. GPTQ is pretty fast though. On my hardware it's actually faster than inferencing with fp16. There's a high-level library called Autogptq (https://github.com/PanQiWei/AutoGPTQ)...
`min_tokens` is an essential missing feature IMO. Would be great to get this merged
Yes please, this would be great. This would allow loading of upto 30B models with a 24GB VRAM consumer GPU. Also, instead of using GPTQ-for-Llama, please use AutoGPTQ ( https://github.com/PanQiWei/AutoGPTQ...
Same here. I'm seeing 20+ tok/s on a 13B model with gptq-for-llama/autogptq and 3-4 toks/s with exllama on my P40. There is a flag for gptq/torch called `use_cuda_fp16 = False`...
@TimyIsCool As mentioned above by @turboderp, fp16 performance on p40 means exllama is going to be slow. Try autogptq/gptq-for-llama loaders instead.