optimum-intel
optimum-intel copied to clipboard
Fix INC WoQ model loading issue
What does this PR do?
woq quantized model saved following the format like https://huggingface.co/TheBloke/Llama-2-7B-Chat-GPTQ/tree/main since itrex v1.4, the quantization config should be added to model.config when model saving and these messages also used for loading. there is a tricky when woq quantized model do save.
from intel_extension_for_transformers.transformers.llm.quantization.utils import convert_to_quantized_model
from intel_extension_for_transformers.transformers.modeling.modeling_auto import save_low_bit
quantized_model = convert_to_quantized_model(model, quantization_config)
# quantized_model 's Linear is QuantizatiedLinearQbits
self._quantized_model.save_pretrained = types.MethodType(save_low_bit, self._quantized_model)
quantized_model.save_pretrained()
# quantized_model 's Linear will be changed to WeightOnlyLinear, it is due to we would like to save the same format with GPTQ
so if we still want to used the quantizer.quantized_model, we should loading it from local to restore it.
quickly validated command
python run_clm.py --model_name_or_path EleutherAI/gpt-neo-125M --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --apply_quantization --quantization_approach weight_only --verify_loading --output_dir ./tmp/clm_output
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you make sure to update the documentation with your changes?
- [ ] Did you write any new necessary tests?