neural-compressor icon indicating copy to clipboard operation
neural-compressor copied to clipboard

Enhance 3.x torch WOQ load

Open yuwenzho opened this issue 1 year ago • 3 comments

Type of Change

feature API changed or not: no

Description

Use different WeightOnlyLinear module according to device.

  • Abstract WeightOnlyLinear class. Inherited class INCWeightOnlyLinear and HPUWeighOnlyLinear
  • Load woq linear weight module by module
  • save hpu format tensor to reuse it once load it again: huggingface format save to local 'hpu_model.safetensor' file; default format save to 'quantized_hpu_weight.pt' file

load huggingface WOQ model example:

from neural_compressor.torch.quantization import load

model_id = "TheBloke/TinyLlama-1.1B-python-v0.1-GPTQ"
# first load: torch.nn.Linear -> INCWeightOnlyLinear -> HPUWeightOnlyLinear, 
# and then save hpu_model.safetensors to local cache dir
qmodel = load(model_name_or_path=model_id, format="huggingface", device="hpu")

# second load: torch.nn.Linear -> HPUWeightOnlyLinear using hpu_model.safetensors saved in local cache dir
qmodel = load(model_name_or_path=model_id, format="huggingface", device="hpu")

load INC WOQ model example:

from neural_compressor.torch.quantization import load

# first load: torch.nn.Linear -> INCWeightOnlyLinear -> HPUWeightOnlyLinear, 
# and then save quantized_hpu_weight.pt to 'saved_results' dir
qmodel = load("saved_results", original_model=fp32_model, device="hpu")

# second load: torch.nn.Linear -> HPUWeightOnlyLinear using quantized_hpu_weight.pt saved in 'saved_results' dir
qmodel = load("saved_results", original_model=fp32_model, device="hpu")

How has this PR been tested?

CI

Dependency Change?

No

yuwenzho avatar Jun 18 '24 08:06 yuwenzho

⛈️ Required checks status: Has failure 🔴

Warning If you do not have the access to re-run the Probot, please contact XuehaoSun for help. If you push a new commit, all of the workflow will be re-triggered.

Groups summary

🟢 Code Scan Tests workflow
Check ID Status Error details
Code-Scan success
Code-Scan (Bandit Code Scan Bandit) success
Code-Scan (DocStyle Code Scan DocStyle) success
Code-Scan (Pylint Code Scan Pylint) success

These checks are required after the changes to neural_compressor/torch/algorithms/weight_only/gptq.py, neural_compressor/torch/algorithms/weight_only/modules.py, neural_compressor/torch/algorithms/weight_only/rtn.py, neural_compressor/torch/algorithms/weight_only/save_load.py, neural_compressor/torch/quantization/load_entry.py, neural_compressor/torch/utils/environ.py, neural_compressor/torch/utils/utility.py.

🟢 Model Tests 3x workflow
Check ID Status Error details
Model-Test-3x success
Model-Test-3x (Generate Report GenerateReport) success
Model-Test-3x (Run PyTorch Model opt_125m_woq_gptq_int4) success
Model-Test-3x (Run PyTorch Model opt_125m_woq_gptq_int4_dq_bnb) success
Model-Test-3x (Run PyTorch Model opt_125m_woq_gptq_int4_dq_ggml) success

These checks are required after the changes to neural_compressor/torch/algorithms/weight_only/gptq.py, neural_compressor/torch/algorithms/weight_only/modules.py, neural_compressor/torch/algorithms/weight_only/rtn.py, neural_compressor/torch/algorithms/weight_only/save_load.py, neural_compressor/torch/quantization/load_entry.py, neural_compressor/torch/utils/environ.py, neural_compressor/torch/utils/utility.py.

🔴 Unit Tests 3x-PyTorch workflow
Check ID Status Error details
UT-3x-Torch failure
UT-3x-Torch (Coverage Compare CollectDatafiles) failure download
UT-3x-Torch (Unit Test 3x Torch Unit Test 3x Torch) success
UT-3x-Torch (Unit Test 3x Torch baseline Unit Test 3x Torch baseline) success

These checks are required after the changes to neural_compressor/torch/algorithms/weight_only/gptq.py, neural_compressor/torch/algorithms/weight_only/modules.py, neural_compressor/torch/algorithms/weight_only/rtn.py, neural_compressor/torch/algorithms/weight_only/save_load.py, neural_compressor/torch/quantization/load_entry.py, neural_compressor/torch/utils/environ.py, neural_compressor/torch/utils/utility.py, test/3x/torch/quantization/weight_only/test_autoround.py, test/3x/torch/quantization/weight_only/test_awq.py, test/3x/torch/quantization/weight_only/test_gptq.py, test/3x/torch/quantization/weight_only/test_load.py, test/3x/torch/quantization/weight_only/test_load_woq_hf_model.py, test/3x/torch/quantization/weight_only/test_rtn.py.


Thank you for your contribution! 💜

Note This comment is automatically generated and will be updates every 180 seconds within the next 6 hours. If you have any other questions, contact chensuyue or XuehaoSun for help.

github-actions[bot] avatar Jun 18 '24 08:06 github-actions[bot]

Abstract WeightOnlyLinear class. Inherited class INCWeightOnlyLinear and HPUWeighOnlyLinear For cpu, how does the woq algorithm use abstract class WeightOnlyLinear ? Do we use INCweightonlinear instead of WeightOnlyLinear?

Kaihui-intel avatar Jun 18 '24 08:06 Kaihui-intel

Abstract WeightOnlyLinear class. Inherited class INCWeightOnlyLinear and HPUWeighOnlyLinear For cpu, how does the woq algorithm use abstract class WeightOnlyLinear ? Do we use INCweightonlinear instead of WeightOnlyLinear?

Yes, algorithm should use INCweightonlinear. Fixed in https://github.com/intel/neural-compressor/pull/1877/commits/56c864f58cee53be0a79e816e5686bbe1fffbce1

yuwenzho avatar Jun 20 '24 08:06 yuwenzho