ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

inference Llama-2-7b-chat-hf failed with 8k input and INT4 precision

Open Fred-cell opened this issue 1 year ago • 3 comments

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 17.67it/s] 2024-03-23 15:30:39,474 - INFO - Converting the current model to sym_int4 format......

loading of model costs 5.125399579002988s and 3.875GB <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'> /home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py:1295: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation) warnings.warn( /home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py:218: UserWarning: Passing padding_mask is deprecated and will be removed in v4.37.Please make sure use attention_mask instead.` warnings.warn( Exception in thread Thread-4 (run_model_in_thread): Traceback (most recent call last): File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 52, in run_model_in_thread output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len, File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate return self.greedy_search( File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search outputs = self( File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in call return self.model(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward outputs = self.model( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 1883, in llama_model_forward layer_outputs = decoder_layer( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 228, in llama_decoder_forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 301, in llama_attention_forward_4_31 return forward_function( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 395, in llama_attention_forward_4_31_quantized attn_output, attn_weights = native_sdp(query_states, repeated_key_states, File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 1307, in native_sdp attn_weights = torch.matmul(query.to(key.dtype), RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 4.00 GiB (GPU 0; 15.59 GiB total capacity; 4.01 GiB already allocated; 4.51 GiB reserved in total by PyTorch)

Fred-cell avatar Mar 23 '24 08:03 Fred-cell

the 2nd latency of 1k input is slower than 2k and 4k: image

Fred-cell avatar Mar 23 '24 08:03 Fred-cell

version 2.5.0b20240319 is normal. meta-llma/Llama-2-7b-chat-hf,526.2,16.7,0.0,1024-512,1,1025-512,1,sym_int4,N/A,5.16,5.35546875,N/A

Fred-cell avatar Mar 23 '24 08:03 Fred-cell

We have reproduced this issue. With 8k input, query and key are of shape [1, 32, 8197, 128]. After matmul, the result tensor is of shape [1, 32, 8197, 8197], for fp16 its size is ~4101M and this triggers the IPEX memory allocation error that IPEX cannot allocate memory blocks over 4GB. image

Try to test only the matmul operation with the same shape and fails as well: image

We are checking if this has solution.

hkvision avatar Mar 29 '24 01:03 hkvision

We have supported 8k input and here's the example: https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Long-Context/LLaMA2-32K Closing this issue.

hkvision avatar Apr 10 '24 08:04 hkvision