LLaMA-3 issues when used with vLLM
I tried these two quantization approaches:
model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv_fast" }
model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv" }
Both result in the same error in vLLM:
File "/home/catid/sources/vllm/vllm/model_executor/layers/linear.py", line 558, in weight_loader
loaded_weight = loaded_weight.narrow(input_dim, start_idx,
RuntimeError: start (0) + length (14336) exceeds dimension size (8192).
(RayWorkerWrapper pid=45548) ERROR 04-20 03:14:37 worker_base.py:153] Error executing method load_model. This might cause deadlock in distributed execution.
gemm works fine though
GEMVFast is not implemented in vLLM yet
I'm planning a PR to implement this functionality in vLLM
https://github.com/vllm-project/vllm/pull/3289
I'm planning a PR to implement this functionality in vLLM
Is there an alternative to implementing continuous batching with GEMVFast? I'd really like to generate a new separate instance while simultaneously generating old batch without waiting for the old batch
Currently, there is no option for it. You will have to wait until other software packages support it.
@catid What RAM do you use for that? My 31GB gets overfilled when quantizing the model