text-generation-webui
text-generation-webui copied to clipboard
Add support for quantized OPT models and refactor
Added ability to use quantized OPT models Added argument to specify quantized model type (LLaMA by default) Removed --load-in-4bit because we already have --gptq-bits Also removed hard-coded names for 4-bit .pt files
Tested with OPT-30B-Erebus on RTX 4090. Works slower than LLaMA but it works
How is performance vs flexgen?
Could you briefly describe how to convert an OPT huggingface model to .pt (or provide a link to pregenerated .pt)?
Would it be similar to this command documented in the GPTQ-llama repo:
python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --save llama7b-4bit.pt
Edit: Looks like the command is:
python opt.py KoboldAI/OPT-13B-Erebus c4 --wbits 4 --save opt-13b-4bit.pt
Does the dataset parameter (the "c4" in above) make a difference in inference? If so which would you recommend?
Good news is, after quantizing a 13B Erebus pt
, the model loads in around 8GB of VRAM and seems to generate text.
Problem is, I'm seeing 5x+ slower generations with very short contexts in 4 bit mode as compared to 13B Llama in 4bit:
> python server.py --gptq-bits 4 --gptq-model-type opt --no-stream --model KoboldAI_OPT-13B-Erebus
Output generated in 49.05 seconds (1.06 tokens/s, 52 tokens)
Output generated in 95.38 seconds (0.84 tokens/s, 80 tokens)
For comparison, llama 13B:
> python server.py --gptq-bits 4 --no-stream --model llama-13b-hf
Output generated in 9.24 seconds (8.66 tokens/s, 80 tokens)
Output generated in 8.68 seconds (9.22 tokens/s, 80 tokens)
# 30B llama model, 200 token generation:
Output generated in 36.11 seconds (5.54 tokens/s, 200 tokens)
Output generated in 42.36 seconds (4.72 tokens/s, 200 tokens)
With larger contexts (800+ tokens), llama model continues to work fine but the OPT model seems to just hang (I gave it multiple minutes before killing process).
@LoopControl can you report that on https://github.com/qwopqwop200/GPTQ-for-LLaMa?
Pinging @qwopqwop200
Benchmarked on opt2.7b, but not as slow as this.