ipex-llm inference Llama-2-7b-chat-hf failed with 8k input and INT4 precision

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 17.67it/s] 2024-03-23 15:30:39,474 - INFO - Converting the current model to sym_int4 format......

loading of model costs 5.125399579002988s and 3.875GB <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'> /home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py:1295: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation) warnings.warn( /home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py:218: UserWarning: Passing padding_mask is deprecated and will be removed in v4.37.Please make sure use attention_mask instead.` warnings.warn( Exception in thread Thread-4 (run_model_in_thread): Traceback (most recent call last): File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 52, in run_model_in_thread output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len, File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate return self.greedy_search( File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search outputs = self( File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in call return self.model(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward outputs = self.model( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 1883, in llama_model_forward layer_outputs = decoder_layer( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 228, in llama_decoder_forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 301, in llama_attention_forward_4_31 return forward_function( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 395, in llama_attention_forward_4_31_quantized attn_output, attn_weights = native_sdp(query_states, repeated_key_states, File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 1307, in native_sdp attn_weights = torch.matmul(query.to(key.dtype), RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 4.00 GiB (GPU 0; 15.59 GiB total capacity; 4.01 GiB already allocated; 4.51 GiB reserved in total by PyTorch)

Mar 23 '24 08:03 Fred-cell

the 2nd latency of 1k input is slower than 2k and 4k:

Mar 23 '24 08:03 Fred-cell

version 2.5.0b20240319 is normal. meta-llma/Llama-2-7b-chat-hf,526.2,16.7,0.0,1024-512,1,1025-512,1,sym_int4,N/A,5.16,5.35546875,N/A

Mar 23 '24 08:03 Fred-cell

We have reproduced this issue. With 8k input, query and key are of shape [1, 32, 8197, 128]. After matmul, the result tensor is of shape [1, 32, 8197, 8197], for fp16 its size is ~4101M and this triggers the IPEX memory allocation error that IPEX cannot allocate memory blocks over 4GB.

Try to test only the matmul operation with the same shape and fails as well:

We are checking if this has solution.

Mar 29 '24 01:03 hkvision

We have supported 8k input and here's the example: https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Long-Context/LLaMA2-32K Closing this issue.

Apr 10 '24 08:04 hkvision

ipex-llm ipex-llm copied to clipboard

inference Llama-2-7b-chat-hf failed with 8k input and INT4 precision

ipex-llm
ipex-llm copied to clipboard