ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Llama-2-7b-chat-hf produces wrong output on CPU

Open gc-fu opened this issue 1 year ago • 1 comments
trafficstars

Details: https://github.com/analytics-zoo/nano/issues/1246#issuecomment-2046881777

This problem happens with transformers version greater than 4.36.0.

The problem can be solved by either setting optimize_model=False, or using transformers==4.34.0.

I guess the problem might be here: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/src/ipex_llm/transformers/convert.py#L846

gc-fu avatar Apr 10 '24 08:04 gc-fu

CC @glorysdj @Oscilloscope98

gc-fu avatar Apr 10 '24 08:04 gc-fu

After some investigation, the cause of the problem is somehow wired.

It seems that the implementation of the native_sdp will generate the wrong output.

However, this native_sdp is roughly the same with transformers' native_sdp function, which indicates that using transformers' LlamaAttention forward function will also generate the wrong output. After some verification, this conclusion is true.

-------------------- Prompt --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]

-------------------- Output --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]

I'm just an AI, I don't have personal preferences or feelings, but I'm here to help you with any questions or

Currently, I found out that we can get the correct output by using this function torch.nn.functional.scaled_dot_product_attention for calculating attention.

-------------------- Prompt --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]

-------------------- Output --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
Ah, a great question! AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence

However, this function cannot return attn_weights. Most of the times, user will not acquire attn_weights. We can temporally fix this problem by using torch.nn.functional.scaled_dot_product_attention if user does not require attn_weights until we find a better way to fix this.

gc-fu avatar Apr 11 '24 07:04 gc-fu

Seems with the same condition, v4.38.0 is correct. I guess there are something missing in our native_sdp.

Update: the attention_mask is always None while doing calculation. I guess this is not the expected behavior.

gc-fu avatar Apr 11 '24 08:04 gc-fu

I did not reproduce this issue in my CPU environment. The result is reasonable. The result is the same whether I set optimize_model=False or True. Code: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2/generate.py

My environment: ipex-llm 2.1.0b20240409 transformers 4.36.2 torch 2.1.0+cpu torchaudio 2.1.0+cpu
torchvision 0.16.0+cpu

My model version is Jul 19 2023

The result of optimize_model=True:

------------------- Prompt --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]

-------------------- Output --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
Ah, a great question! AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence

The result of optimize_model=False:

-------------------- Prompt --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]

-------------------- Output --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
Ah, a great question! AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence

jenniew avatar Apr 11 '24 20:04 jenniew

After I update torch to 2.1.2, I reproduced this issue. torch 2.1.2+cpu
torchaudio 2.1.2+cpu
torchvision 0.16.2+cpu

But for my side, when optimize_model=False would produce correct answer, while optimize_model=True produce wrong answer.

optimize_method=True:

-------------------- Prompt --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]

-------------------- Output --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]

I'm just an AI, I don't have personal preferences or feelings, but I'm here to help you with any questions or

optimize_method=False:

-------------------- Prompt --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]

-------------------- Output --------------------

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>
What is AI?[/INST]
Ah, a great question! AI, or Artificial Intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence

So this issue may be caused by pytorch computation? For optimize_model=True and optimize_model=False generate different outputs, we may double check if the model calculation is exactly same or not.

jenniew avatar Apr 11 '24 21:04 jenniew

I test on transformers v4.38.0, when optimize_model=True, it cannot work due to transformers 4.38+ add cache_position parameter to forward. The error is as below: Traceback (most recent call last): File "/home/jwang/ipex-llm-jennie/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2/./generate.py", line 65, in output = model.generate(input_ids, File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/transformers/generation/utils.py", line 1544, in generate return self.greedy_search( File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/transformers/generation/utils.py", line 2404, in greedy_search outputs = self( File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward outputs = self.model( File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) TypeError: llama_model_forward_4_36() got an unexpected keyword argument 'cache_position'

jenniew avatar Apr 12 '24 19:04 jenniew

I test on transformers v4.38.0, when optimize_model=True, it cannot work due to transformers 4.38+ add cache_position parameter to forward. The error is as below: Traceback (most recent call last): File "/home/jwang/ipex-llm-jennie/python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2/./generate.py", line 65, in output = model.generate(input_ids, File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/transformers/generation/utils.py", line 1544, in generate return self.greedy_search( File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/transformers/generation/utils.py", line 2404, in greedy_search outputs = self( File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward outputs = self.model( File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/anaconda3/envs/jiao-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) TypeError: llama_model_forward_4_36() got an unexpected keyword argument 'cache_position'

I tested v4.38.0 without ipex-llm optimizations. Compared with v4.36.0 that gets wrong output when output_attention=True, the result v4.38.0 get is correct.

gc-fu avatar Apr 15 '24 00:04 gc-fu

I compared original llama implementation in transformer 4.36 and our llama_attention_forward_4_36_original code. The original llama generation used LlamaSdpaAttention by default, the logic is as below:

In LlamaModel forward:

if not output_attentions:
    caculate attention_mask by _prepare_4d_causal_attention_mask_for_sdpa. The calculated attention_mask is None
else:
    caculate attention_mask by _prepare_4d_causal_attention_mask. The calculated attention_mask is not None. 

In LlamaSdpaAttention forward:

if not output_attentions:
    use scaled_dot_product_attention to calculate attn_output. The input attention_mask is None.
else:
    use native_sdp to calculate attn_weights, attn_output. The input attention_mask is not None, the calculate result is correct. 

In our optimized attention forward logic:

if not output_attentions:
     we still use native_sdp to calculate attn_weights, attn_output. But the input attention_mask is None, the attn_output is not correct

So we need to change our logic to do the same logic as original LlamaSdpaAttention logic.

@gc-fu your fix: https://github.com/intel-analytics/ipex-llm/pull/10742 is the as the same logic as the original LlamaSdpaAttention. The attention_mask is always None is OK. which is the same as the original implementation. When user want to output attn_weights, after they set output_attentions=True, the attention_mask is not None, and the native_sdp calculation is correct.

jenniew avatar Apr 17 '24 22:04 jenniew

Closed as completed~ Thanks @jenniew

gc-fu avatar Apr 18 '24 03:04 gc-fu