vllm What is the correct way to use quantized versions of vicuna or guanco?

I have been trying to use quantized versions of models to use my GPU whose VRAM is 6GB max. However nothing seems to work. How would I go about using 5bit versions that use under 6GB in memory?

Jun 22 '23 16:06 armsp

We haven't tested quantized models by ourselves but their support is in our plan. Can you share your model, code, and the error message you get?

Jun 22 '23 16:06 zhuohan123

@armsp vLLM does not support quantization at the moment. However, could you let us know the data type & quantization method you use for the models? That will definitely help our development decision.

Jun 22 '23 16:06 WoosukKwon

For what it's worth, https://github.com/qwopqwop200/GPTQ-for-LLaMa and https://github.com/PanQiWei/AutoGPTQ seem to be the most common mentions from folks posting quantized models on huggingface lately - the later more just for general use. https://github.com/SqueezeAILab/SqueezeLLM seems to do a much better job of 3bit in particular, but I haven't seen much general use yet.. maybe down the road?

Jun 22 '23 20:06 dillonroach

@WoosukKwon @zhuohan123 sure. I wasn't trying anything fancy neither was I trying to quantize my own models. I was just trying to use the 4 and 5_1 bit quantized models that are available and I tried to do that by just changing the model path/name. Not quite sure how to use local models yet, but maybe using local models works? For example - there is a consumer hardware version of JosephusCheung/Guanaco -> https://huggingface.co/JosephusCheung/GuanacoOnConsumerHardware which is supposed to take only 5+GB V-Ram. The error I get for that is -

Downloading (…)lve/main/config.json: 100%|████████████████| 568/568 [00:00<00:00, 307kB/s]
INFO 06-23 10:59:16 llm_engine.py:59] Initializing an LLM engine with config: model='JosephusCheung/GuanacoOnConsumerHardware', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-23 10:59:16 tokenizer_utils.py:30] Using the LLaMA fast tokenizer in 'hf-internal-testing/llama-tokenizer' to avoid potential protobuf errors.
Traceback (most recent call last):
  File "demo.py", line 20, in <module>
    llm = LLM(model="JosephusCheung/GuanacoOnConsumerHardware")  # Name or path of your model
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 55, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 151, in from_engine_args
    engine = cls(*engine_configs, distributed_init_method, devices,
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 93, in __init__
    worker = worker_cls(
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/worker/worker.py", line 45, in __init__
    self.model = get_model(model_config)
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 39, in get_model
    model = model_class(model_config.hf_config)
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 215, in __init__
    self.model = LlamaModel(config)
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in __init__
    self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in <listcomp>
    self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 136, in __init__
    self.mlp = LlamaMLP(
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 59, in __init__
    self.gate_up_proj = ColumnParallelLinear(hidden_size, 2 * intermediate_size,
  File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in __init__
    self.weight = Parameter(torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 6.00 GiB total capacity; 5.27 GiB already allocated; 0 bytes free; 5.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried the quantized version of vicuna: https://huggingface.co/vicuna/ggml-vicuna-7b-1.1, https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized, https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g For other models the errors I get are either OOM, or that I need login access to huggingface hub.

Jun 23 '23 09:06 armsp

Also support for gguf would be nice

Sep 17 '23 13:09 thistleknot

Closing because quantization is supported via GPTQ, AWQ, SqueezeLLM.

Mar 08 '24 12:03 hmellor

vllm vllm copied to clipboard

What is the correct way to use quantized versions of vicuna or guanco?

vllm
vllm copied to clipboard