SqueezeLLM D+S packing in vLLM seems buggy

D+S packing in vLLM seems buggy

Open MingLin-home opened this issue 1 year ago • 0 comments

Hello!

I followed D+S packing instruction and stored the packed .pt file in "~/models/${model_name}-squeezellm/packed_weight", where model_name="Llama-2-7b-chat-hf". When I load this model in vLLM:

python examples/llm_engine_example.py --dtype float16 --model ~/models/${model_name}-squeezellm/packed_weight --quantization squeezellm

vLLM complained cannot find parameters "sparse_threshold.model.layers.*". Any idea why? I repeated the quantization from scratch several times but all ended up in this error.

To get a quick fix, I manually skip the above error in vLLM model loading step in llama.py , if we cannot find the missing param. However, this time the model cannot generate meaningful output. So I believe the above parameters are indeed not loaded correctly.

Feb 27 '24 05:02 MingLin-home

SqueezeLLM SqueezeLLM copied to clipboard

D+S packing in vLLM seems buggy

SqueezeLLM
SqueezeLLM copied to clipboard