text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Add support for quantized OPT models and refactor

Open Zerogoki00 opened this issue 1 year ago • 1 comments

Added ability to use quantized OPT models Added argument to specify quantized model type (LLaMA by default) Removed --load-in-4bit because we already have --gptq-bits Also removed hard-coded names for 4-bit .pt files

Tested with OPT-30B-Erebus on RTX 4090. Works slower than LLaMA but it works

Zerogoki00 avatar Mar 13 '23 17:03 Zerogoki00

How is performance vs flexgen?

Ph0rk0z avatar Mar 13 '23 17:03 Ph0rk0z

Could you briefly describe how to convert an OPT huggingface model to .pt (or provide a link to pregenerated .pt)?

Would it be similar to this command documented in the GPTQ-llama repo: python llama.py decapoda-research/llama-7b-hf c4 --wbits 4 --save llama7b-4bit.pt

Edit: Looks like the command is: python opt.py KoboldAI/OPT-13B-Erebus c4 --wbits 4 --save opt-13b-4bit.pt

Does the dataset parameter (the "c4" in above) make a difference in inference? If so which would you recommend?

LoopControl avatar Mar 13 '23 22:03 LoopControl

Good news is, after quantizing a 13B Erebus pt, the model loads in around 8GB of VRAM and seems to generate text.

Problem is, I'm seeing 5x+ slower generations with very short contexts in 4 bit mode as compared to 13B Llama in 4bit:

> python server.py --gptq-bits 4 --gptq-model-type opt --no-stream --model KoboldAI_OPT-13B-Erebus
Output generated in 49.05 seconds (1.06 tokens/s, 52 tokens)
Output generated in 95.38 seconds (0.84 tokens/s, 80 tokens)

For comparison, llama 13B:

> python server.py --gptq-bits 4 --no-stream --model llama-13b-hf
Output generated in 9.24 seconds (8.66 tokens/s, 80 tokens)
Output generated in 8.68 seconds (9.22 tokens/s, 80 tokens)

# 30B llama model, 200 token generation:
Output generated in 36.11 seconds (5.54 tokens/s, 200 tokens)
Output generated in 42.36 seconds (4.72 tokens/s, 200 tokens)

With larger contexts (800+ tokens), llama model continues to work fine but the OPT model seems to just hang (I gave it multiple minutes before killing process).

LoopControl avatar Mar 14 '23 03:03 LoopControl

@LoopControl can you report that on https://github.com/qwopqwop200/GPTQ-for-LLaMa?

Pinging @qwopqwop200

oobabooga avatar Mar 14 '23 03:03 oobabooga

Benchmarked on opt2.7b, but not as slow as this.

qwopqwop200 avatar Mar 14 '23 03:03 qwopqwop200