ipex-llm
ipex-llm copied to clipboard
Llama-2-7b-chat-hf produces wrong output on CPU
Details: https://github.com/analytics-zoo/nano/issues/1246#issuecomment-2046881777
This problem happens with transformers version greater than 4.36.0.
The problem can be solved by either setting optimize_model=False, or using transformers==4.34.0.
I guess the problem might be here: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/src/ipex_llm/transformers/convert.py#L846
CC @glorysdj @Oscilloscope98
After some investigation, the cause of the problem is somehow wired.
It seems that the implementation of the native_sdp will generate the wrong output.
However, this native_sdp is roughly the same with transformers' native_sdp function, which indicates that using transformers' LlamaAttention forward function will also generate the wrong output. After some verification, this conclusion is true.
-------------------- Prompt --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
-------------------- Output --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
I'm just an AI, I don't have personal preferences or feelings, but I'm here to help you with any questions or
Currently, I found out that we can get the correct output by using this function torch.nn.functional.scaled_dot_product_attention for calculating attention.
-------------------- Prompt --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
-------------------- Output --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
Ah, a great question! AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence
However, this function cannot return attn_weights. Most of the times, user will not acquire attn_weights. We can temporally fix this problem by using torch.nn.functional.scaled_dot_product_attention if user does not require attn_weights until we find a better way to fix this.
Seems with the same condition, v4.38.0 is correct. I guess there are something missing in our native_sdp.
Update: the attention_mask is always None while doing calculation. I guess this is not the expected behavior.
I did not reproduce this issue in my CPU environment. The result is reasonable. The result is the same whether I set optimize_model=False or True. Code: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2/generate.py
My environment:
ipex-llm 2.1.0b20240409
transformers 4.36.2
torch 2.1.0+cpu
torchaudio 2.1.0+cpu
torchvision 0.16.0+cpu
My model version is Jul 19 2023
The result of optimize_model=True:
------------------- Prompt --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
-------------------- Output --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
Ah, a great question! AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence
The result of optimize_model=False:
-------------------- Prompt --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
-------------------- Output --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
Ah, a great question! AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence
After I update torch to 2.1.2, I reproduced this issue.
torch 2.1.2+cpu
torchaudio 2.1.2+cpu
torchvision 0.16.2+cpu
But for my side, when optimize_model=False would produce correct answer, while optimize_model=True produce wrong answer.
optimize_method=True:
-------------------- Prompt --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
-------------------- Output --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
I'm just an AI, I don't have personal preferences or feelings, but I'm here to help you with any questions or
optimize_method=False:
-------------------- Prompt --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
-------------------- Output --------------------
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
Ah, a great question! AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence
So this issue may be caused by pytorch computation? For optimize_model=True and optimize_model=False generate different outputs, we may double check if the model calculation is exactly same or not.
I test on transformers v4.38.0, when optimize_model=True, it cannot work due to transformers 4.38+ add cache_position parameter to forward. The error is as below:
Traceback (most recent call last):
File "/home/jwang/ipex-llm-jennie/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2/./generate.py", line 65, in
I test on transformers v4.38.0, when optimize_model=True, it cannot work due to transformers 4.38+ add
cache_positionparameter to forward. The error is as below: Traceback (most recent call last): File "/home/jwang/ipex-llm-jennie/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2/./generate.py", line 65, in output = model.generate(input_ids, File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/transformers/generation/utils.py", line 1544, in generate return self.greedy_search( File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/transformers/generation/utils.py", line 2404, in greedy_search outputs = self( File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward outputs = self.model( File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) TypeError: llama_model_forward_4_36() got an unexpected keyword argument 'cache_position'
I tested v4.38.0 without ipex-llm optimizations. Compared with v4.36.0 that gets wrong output when output_attention=True, the result v4.38.0 get is correct.
I compared original llama implementation in transformer 4.36 and our llama_attention_forward_4_36_original code. The original llama generation used LlamaSdpaAttention by default, the logic is as below:
In LlamaModel forward:
if not output_attentions:
caculate attention_mask by _prepare_4d_causal_attention_mask_for_sdpa. The calculated attention_mask is None
else:
caculate attention_mask by _prepare_4d_causal_attention_mask. The calculated attention_mask is not None.
In LlamaSdpaAttention forward:
if not output_attentions:
use scaled_dot_product_attention to calculate attn_output. The input attention_mask is None.
else:
use native_sdp to calculate attn_weights, attn_output. The input attention_mask is not None, the calculate result is correct.
In our optimized attention forward logic:
if not output_attentions:
we still use native_sdp to calculate attn_weights, attn_output. But the input attention_mask is None, the attn_output is not correct
So we need to change our logic to do the same logic as original LlamaSdpaAttention logic.
@gc-fu your fix: https://github.com/intel-analytics/ipex-llm/pull/10742 is the as the same logic as the original LlamaSdpaAttention. The attention_mask is always None is OK. which is the same as the original implementation.
When user want to output attn_weights, after they set output_attentions=True, the attention_mask is not None, and the native_sdp calculation is correct.
Closed as completed~ Thanks @jenniew