SqueezeLLM
SqueezeLLM copied to clipboard
D+S packing in vLLM seems buggy
Hello!
I followed D+S packing instruction and stored the packed .pt file in "~/models/${model_name}-squeezellm/packed_weight", where model_name="Llama-2-7b-chat-hf". When I load this model in vLLM:
python examples/llm_engine_example.py --dtype float16 --model ~/models/${model_name}-squeezellm/packed_weight --quantization squeezellm
vLLM complained cannot find parameters "sparse_threshold.model.layers.*". Any idea why? I repeated the quantization from scratch several times but all ended up in this error.
To get a quick fix, I manually skip the above error in vLLM model loading step in llama.py , if we cannot find the missing param. However, this time the model cannot generate meaningful output. So I believe the above parameters are indeed not loaded correctly.