ipex-llm
ipex-llm copied to clipboard
inference Llama-2-7b-chat-hf failed with 8k input and INT4 precision
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 17.67it/s] 2024-03-23 15:30:39,474 - INFO - Converting the current model to sym_int4 format......
loading of model costs 5.125399579002988s and 3.875GB <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'> /home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py:1295: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation) warnings.warn( /home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py:218: UserWarning: Passing
padding_maskis deprecated and will be removed in v4.37.Please make sure useattention_maskinstead.` warnings.warn( Exception in thread Thread-4 (run_model_in_thread): Traceback (most recent call last): File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/run.py", line 52, in run_model_in_thread output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len, File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate return self.greedy_search( File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search outputs = self( File "/home/intel/LLM/BigDL/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in call return self.model(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward outputs = self.model( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 1883, in llama_model_forward layer_outputs = decoder_layer( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 228, in llama_decoder_forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 301, in llama_attention_forward_4_31 return forward_function( File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 395, in llama_attention_forward_4_31_quantized attn_output, attn_weights = native_sdp(query_states, repeated_key_states, File "/home/intel/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/llama.py", line 1307, in native_sdp attn_weights = torch.matmul(query.to(key.dtype), RuntimeError: Current platform can NOT allocate memory block with size larger than 4GB! Tried to allocate 4.00 GiB (GPU 0; 15.59 GiB total capacity; 4.01 GiB already allocated; 4.51 GiB reserved in total by PyTorch)
the 2nd latency of 1k input is slower than 2k and 4k:
version 2.5.0b20240319 is normal. meta-llma/Llama-2-7b-chat-hf,526.2,16.7,0.0,1024-512,1,1025-512,1,sym_int4,N/A,5.16,5.35546875,N/A
We have reproduced this issue. With 8k input, query and key are of shape [1, 32, 8197, 128]. After matmul, the result tensor is of shape [1, 32, 8197, 8197], for fp16 its size is ~4101M and this triggers the IPEX memory allocation error that IPEX cannot allocate memory blocks over 4GB.
Try to test only the matmul operation with the same shape and fails as well:
We are checking if this has solution.
We have supported 8k input and here's the example: https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Long-Context/LLaMA2-32K Closing this issue.