vllm
vllm copied to clipboard
What is the correct way to use quantized versions of vicuna or guanco?
I have been trying to use quantized versions of models to use my GPU whose VRAM is 6GB max. However nothing seems to work. How would I go about using 5bit versions that use under 6GB in memory?
We haven't tested quantized models by ourselves but their support is in our plan. Can you share your model, code, and the error message you get?
@armsp vLLM does not support quantization at the moment. However, could you let us know the data type & quantization method you use for the models? That will definitely help our development decision.
For what it's worth, https://github.com/qwopqwop200/GPTQ-for-LLaMa and https://github.com/PanQiWei/AutoGPTQ seem to be the most common mentions from folks posting quantized models on huggingface lately - the later more just for general use. https://github.com/SqueezeAILab/SqueezeLLM seems to do a much better job of 3bit in particular, but I haven't seen much general use yet.. maybe down the road?
@WoosukKwon @zhuohan123 sure. I wasn't trying anything fancy neither was I trying to quantize my own models. I was just trying to use the 4 and 5_1 bit quantized models that are available and I tried to do that by just changing the model path/name. Not quite sure how to use local models yet, but maybe using local models works? For example - there is a consumer hardware version of JosephusCheung/Guanaco
-> https://huggingface.co/JosephusCheung/GuanacoOnConsumerHardware which is supposed to take only 5+GB V-Ram.
The error I get for that is -
Downloading (…)lve/main/config.json: 100%|████████████████| 568/568 [00:00<00:00, 307kB/s]
INFO 06-23 10:59:16 llm_engine.py:59] Initializing an LLM engine with config: model='JosephusCheung/GuanacoOnConsumerHardware', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-23 10:59:16 tokenizer_utils.py:30] Using the LLaMA fast tokenizer in 'hf-internal-testing/llama-tokenizer' to avoid potential protobuf errors.
Traceback (most recent call last):
File "demo.py", line 20, in <module>
llm = LLM(model="JosephusCheung/GuanacoOnConsumerHardware") # Name or path of your model
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 55, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 151, in from_engine_args
engine = cls(*engine_configs, distributed_init_method, devices,
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 93, in __init__
worker = worker_cls(
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/worker/worker.py", line 45, in __init__
self.model = get_model(model_config)
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 39, in get_model
model = model_class(model_config.hf_config)
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 215, in __init__
self.model = LlamaModel(config)
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in __init__
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 182, in <listcomp>
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 136, in __init__
self.mlp = LlamaMLP(
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/models/llama.py", line 59, in __init__
self.gate_up_proj = ColumnParallelLinear(hidden_size, 2 * intermediate_size,
File "/home/eli/vllm-test/vllm/lib/python3.8/site-packages/vllm/model_executor/parallel_utils/tensor_parallel/layers.py", line 272, in __init__
self.weight = Parameter(torch.empty(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 6.00 GiB total capacity; 5.27 GiB already allocated; 0 bytes free; 5.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I tried the quantized version of vicuna: https://huggingface.co/vicuna/ggml-vicuna-7b-1.1, https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized, https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g For other models the errors I get are either OOM, or that I need login access to huggingface hub.
Also support for gguf would be nice
Closing because quantization is supported via GPTQ, AWQ, SqueezeLLM.