LoopControl comments

Results 44 comments of


                                            LoopControl

How do I load the fine tuned model for inference?

@mazzzystar Not sure on that but it definitely seems wrong. My `.bin` file is 26M for a 13B model for me so I'd expect the 7B to be a few...

Refactor the API

I'd love to see some separation or even the possibility to not run the model with this repo and instead just use the sveltekit app + mongo with an API...

replit model converter for ctranslate2 backend

:+1: Would love to see support for replit 1.3B

Multilanguage (translation) support + support for local OpenAI server realizations

@maozdemir could you look into merging this PR please? I’d love to try privategpt but I wish to use it with text-generation-webui instead of llama.cpp

Quantizing Lora Alpaca (a model with an adapter)

+1 for lora support (ideally 4 bit lora support please as output from https://github.com/johnsmith0031/alpaca_lora_4bit !)

Yep, 4bit inference with bnb is super slow. GPTQ is pretty fast though. On my hardware it's actually faster than inferencing with fp16. There's a high-level library called Autogptq (https://github.com/PanQiWei/AutoGPTQ)...

Exclude tokens and Min Tokens

`min_tokens` is an essential missing feature IMO. Would be great to get this merged

[Feature Request] GPTQ support

Yes please, this would be great. This would allow loading of upto 30B models with a 24GB VRAM consumer GPU. Also, instead of using GPTQ-for-Llama, please use AutoGPTQ ( https://github.com/PanQiWei/AutoGPTQ...

Tesla P40 only using 70W underload

Same here. I'm seeing 20+ tok/s on a 13B model with gptq-for-llama/autogptq and 3-4 toks/s with exllama on my P40. There is a flag for gptq/torch called `use_cuda_fp16 = False`...

Tesla P40 only using 70W underload

@TimyIsCool As mentioned above by @turboderp, fp16 performance on p40 means exllama is going to be slow. Try autogptq/gptq-for-llama loaders instead.