lightllm icon indicating copy to clipboard operation
lightllm copied to clipboard

Quantization support

Open generalsvr opened this issue 1 year ago • 8 comments

How to use 8bit quantized models? Can I run GGML/GGUF models?

generalsvr avatar Oct 16 '23 06:10 generalsvr

8bit weightonly quantized only support llama now

hiworldwzj avatar Oct 16 '23 09:10 hiworldwzj

Any examples?

generalsvr avatar Oct 16 '23 09:10 generalsvr

parser.add_argument("--mode", type=str, default=[], nargs='+',
                    help="Model mode: [int8kv] [int8weight | int4weight]")

hiworldwzj avatar Oct 17 '23 06:10 hiworldwzj

As for the model file format, we have not tested GGML/GGUF up to now. What is the motivation to use these formats?

XHPlus avatar Oct 19 '23 01:10 XHPlus

Will GPTQ be supported?

JustinLin610 avatar Oct 19 '23 16:10 JustinLin610

@XHPlus There's a lot of open source models on HuggingFace driven by https://huggingface.co/TheBloke. Many people in the open source community use those quantized models on TGI / vLLM.

suhjohn avatar Nov 14 '23 19:11 suhjohn

parser.add_argument("--mode", type=str, default=[], nargs='+',
                    help="Model mode: [int8kv] [int8weight | int4weight]")

Using this option with Llama2-13B gives this error:

_get_exception_class.<locals>.Derived: 'LlamaTransformerLayerWeightQuantized' object has no attribute 'quantize_weight'

I tried both --mode int8kv int4weight and --mode int8kv int4weight

Any suggestions how to fix this?

adi avatar Feb 08 '24 18:02 adi

@XHPlus Quantization is partially the only way to run bigger models in smaller GPUs, e.g. Mixtral. With vLLM, I can run mixtral quantized with 48 GBs of VRAM. The unquantized model would use up to 100GB VRam i guess.

VfBfoerst avatar Mar 07 '24 10:03 VfBfoerst