Fix INC WoQ model loading issue

Open changwangss opened this issue 8 months ago • 1 comments

What does this PR do?

woq quantized model saved following the format like https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ/tree/main since itrex v1.4, the quantization config should be added to model.config when model saving and these messages also used for loading. there is a tricky when woq quantized model do save.

from intel_extension_for_transformers.transformers.llm.quantization.utils import convert_to_quantized_model
from intel_extension_for_transformers.transformers.modeling.modeling_auto import save_low_bit

quantized_model = convert_to_quantized_model(model, quantization_config)
# quantized_model 's Linear is QuantizatiedLinearQbits 
self._quantized_model.save_pretrained = types.MethodType(save_low_bit, self._quantized_model)
quantized_model.save_pretrained()
# quantized_model 's Linear will be changed to WeightOnlyLinear, it is due to we would like to save the same format with GPTQ

so if we still want to used the quantizer.quantized_model, we should loading it from local to restore it.

quickly validated command

 python run_clm.py --model_name_or_path EleutherAI/gpt-neo-125M --dataset_name wikitext  --dataset_config_name wikitext-2-raw-v1 --apply_quantization --quantization_approach weight_only --verify_loading  --output_dir ./tmp/clm_output

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you make sure to update the documentation with your changes?
[ ] Did you write any new necessary tests?

Jun 19 '24 10:06 changwangss

optimum-intel optimum-intel copied to clipboard

Fix INC WoQ model loading issue

What does this PR do?

Before submitting

optimum-intel
optimum-intel copied to clipboard