AutoAWQ LLaMA-3 issues when used with vLLM

I tried these two quantization approaches:

model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv_fast" }

model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv" }

Both result in the same error in vLLM:

  File "/home/catid/sources/vllm/vllm/model_executor/layers/linear.py", line 558, in weight_loader
    loaded_weight = loaded_weight.narrow(input_dim, start_idx,
RuntimeError: start (0) + length (14336) exceeds dimension size (8192).
(RayWorkerWrapper pid=45548) ERROR 04-20 03:14:37 worker_base.py:153] Error executing method load_model. This might cause deadlock in distributed execution.

gemm works fine though

Apr 20 '24 03:04 catid

GEMVFast is not implemented in vLLM yet

Apr 20 '24 09:04 casper-hansen

I'm planning a PR to implement this functionality in vLLM

https://github.com/vllm-project/vllm/pull/3289

Apr 20 '24 10:04 casper-hansen

I'm planning a PR to implement this functionality in vLLM

Is there an alternative to implementing continuous batching with GEMVFast? I'd really like to generate a new separate instance while simultaneously generating old batch without waiting for the old batch

Apr 21 '24 19:04 SinanAkkoyun

Currently, there is no option for it. You will have to wait until other software packages support it.

Apr 29 '24 10:04 casper-hansen

@catid What RAM do you use for that? My 31GB gets overfilled when quantizing the model

May 08 '24 13:05 danielstankw