intel-extension-for-pytorch Failed to do quantization for models like EleutherAI/gpt-neox-20b and bigscience/bloom-7b1

Describe the bug

MODEL_ID="/models/models--EleutherAI--gpt-neox-20b" mkdir saved_results_gpt_neox python run_gpt-neox_int8.py --ipex-weight-only-quantization --output-dir "saved_results_gpt_neox" --jit -m ${MODEL_ID} --int8

MODEL_ID="/models/models--bigscience--bloom-7b1" mkdir saved_results_bloom python run_bloom_int8.py --ipex-weight-only-quantization --output-dir "saved_results_bloom" --jit -m ${MODEL_ID} --int8-bf16-mixed

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:39<00:00, 19.74s/it] Some weights of BloomForCausalLM were not initialized from the model checkpoint at /models/models--bigscience--bloom-7b1 and are newly initialized: ['lm_head.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Data type of the model: torch.float32 /opt/conda/envs/llm/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py:105: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. base = torch.tensor( /opt/conda/envs/llm/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/attentions.py:143: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if input_shape[1] + past_key_values_length != attention_mask.shape[1]: /opt/conda/envs/llm/lib/python3.9/site-packages/intel_extension_for_pytorch/cpu/transformers/attentions.py:153: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if seq_length > 1: not implementednot implementednot implementednot implementednot implementednot implementednot implemented not implemented not implemented not implemented not implemented

nnot implemented not implemented

not implemented

Versions

llm_feature_branch latest self compiled version git clone --branch llm_feature_branch https://github.com/intel/intel-extension-for-pytorch.git cd intel-extension-for-pytorch git submodule sync && git submodule update --init --recursive export DNNL_GRAPH_BUILD_COMPILER_BACKEND=1 export CXXFLAGS="${CXXFLAGS} -D__STDC_FORMAT_MACROS" python setup.py install cd ../

Sep 27 '23 09:09 RenyanDiao

@RenyanDiao i tried with the same CMD, and the quantized model is there:

python run_bloom_int8.py --ipex-weight-only-quantization --output-dir "saved_results_bloom" --jit -m bigscience/bloom-7b1 --int8-bf16-mixed
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.90s/it]
Some weights of BloomForCausalLM were not initialized from the model checkpoint at bigscience/bloom-7b1 and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Data type of the model: torch.float32
/home/jianan/anaconda3/envs/llm/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py:105: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  base = torch.tensor(
/home/jianan/debug/intel-extension-for-pytorch/intel_extension_for_pytorch/cpu/transformers/attentions.py:143: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input_shape[1] + past_key_values_length != attention_mask.shape[1]:
/home/jianan/debug/intel-extension-for-pytorch/intel_extension_for_pytorch/cpu/transformers/attentions.py:153: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_length > 1:
/home/jianan/debug/intel-extension-for-pytorch/intel_extension_for_pytorch/cpu/transformers/attentions.py:1755: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  torch.tensor(layer_past[3] + query_layer.shape[1], dtype=torch.long),
/home/jianan/debug/intel-extension-for-pytorch/intel_extension_for_pytorch/cpu/transformers/attentions.py:1755: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  torch.tensor(layer_past[3] + query_layer.shape[1], dtype=torch.long),

ls saved_results_bloom/
best_model.pt

Did you use the latest scripts here? intel-extension-for-pytorch/examples/cpu/inference/python/llm

Sep 28 '23 01:09 jianan-gu

cc @Xia-Weiwen

Sep 28 '23 01:09 jianan-gu

Hi @RenyanDiao .Thanks for reporting this issue. I think it's probably due to GCC version and/or hardware platform. Could you please share GCC and CPU info? Thanks.

$ gcc --version
$ lscpu

Sep 28 '23 02:09 Xia-Weiwen