LaaZa
LaaZa
I think the issue is simply that the example does not use instruct format specific to Vicuna and you have possibly different sampling parameters and the stopping criteria is not...
The prompt should match the instruction template when you advance in question rounds. End the prompt with the assistants turn like ### Assistant: so it knows to answer as itself...
If you only have 4 GB of VRAM you are never going to be able to load a 13B model onto GPU. Look into trying GGML models.
You don't. It depends on your GPU. But you can try GGML models since they run on the cpu and use system RAM. Not going to be fast though.
13B LLaMA model will not fit 24 GB of vram. You need to load it in either 8bit with load_in_8bit or use a GPTQ quantized model which are usually 4bit.
Just set the `--load-in-8bit` flag or check that option in webui when you load the model. For GPTQ quantization you could use [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) or [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa). Currently textgen uses the latter...
Setting gpu-memory will offload to ram any excess. May make inference much slower. Honestly the very tiny degradation with quanitzation is usually well worth the tradeoff.
Honestly though, if we take into account the immense reduction in memory requirement, the differences in perplexity scores are insignificantly small. This is a comparison table from llama.cpp(different to GPTQ...
No, the quantization does a lot of smart things to minimize negative impact. Every format has differences in bit allocation(what they are used for) so we can't just "chop it...
> @LaaZa wow yeah, those differences are pretty minor. I'm not very familiar with perplexity though. Is it able to reflect model accuracy well? It measures how well the model...