intel-extension-for-transformers icon indicating copy to clipboard operation
intel-extension-for-transformers copied to clipboard

Can't load woq int4 model

Open YangShuaiTHU opened this issue 1 year ago • 4 comments

I'm trying to evaluate the int4 quantized model with the tools using

from intel_extension_for_transformers.llm.evaluation.lm_eval import evaluate

like what's done in /examples/huggingface/pytorch/text_generation. I succeeded when I quantized my local Llama13B model like

model=AutoModelForCausalLM.from_pretrained(model_name,quantization_config=woq_config,...)

and put this quantized model to the evaluate function

results=evaluate(... user_model=model,...)

But I also tried to save this quantized model and got a size_mismatch error when loading it. I used

model.save_pretrained(saved_dir)
user_model=AutoModelForCausalLM.from_pretrained(saved_dir)

and it raised

RuntimeError:Errors(s) in loading state_dict for LlamaForCausalLM:
    size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([16435456]) from checkpoint, the shape in current model is torch.Size([5120,5120]).
    ...

Can anyone tell me why this issue is occurring?

YangShuaiTHU avatar Jan 02 '24 08:01 YangShuaiTHU

Hi, i guess this issue may caused by the shape of compressed-weight is not match with the raw-weight-shape. e.g. model.layers.0.self_attn.q_proj.weight may need 5120*5120*sizeof(float) bytes data, but after woq compress, we only need 16435456 bytes(contain 4bit-weight ,scales and etc), so we create a 1D int8 tensor to hold the compressed weight, but fail to load it because of some shape-safety-checks. A tmp-solution is using _resize func to reset the shape of compressed-weight. But this may waste memory(5120*5120*sizeof(int8)-16435456=9778944 bytes). We will try to find some better solutions, we also welcome some smart ideas proposed by the community:)

zhewang1-intc avatar Jan 03 '24 06:01 zhewang1-intc

Hi,@YangShuaiTHU, Could you tell us what model you used? I test loading/saving in UT:tests/CI/test_weight_only_gpu.py, it is OK. BTW, we used Transformers version is 4.34.1

PenghuiCheng avatar Jan 03 '24 09:01 PenghuiCheng

Hi,@YangShuaiTHU, Could you tell us what model you used? I test loading/saving in UT:tests/CI/test_weight_only_gpu.py, it is OK. BTW, we used Transformers version is 4.34.1

Thanks for your reply! The model is .safetensors from https://huggingface.co/huggyllama/llama-13b, and my Transformers verision is also 4.34.1.

YangShuaiTHU avatar Jan 04 '24 02:01 YangShuaiTHU

Hi, i guess this issue may caused by the shape of compressed-weight is not match with the raw-weight-shape. e.g. model.layers.0.self_attn.q_proj.weight may need 51205120sizeof(float) bytes data, but after woq compress, we only need 16435456 bytes(contain 4bit-weight ,scales and etc), so we create a 1D int8 tensor to hold the compressed weight, but fail to load it because some shape-safety-check. A tmp-solution is using _resize func to reset the shape of compressed-weight. But this may waste memory(51205120sizeof(int8)-16435456=9778944 bytes). We will try to find some better solutions, we also welcome some smart ideas proposed by the community:)

Thank you!

YangShuaiTHU avatar Jan 04 '24 02:01 YangShuaiTHU

now the issue has been fixed.

PenghuiCheng avatar Jun 06 '24 01:06 PenghuiCheng