Qwen2.5 series error when prune
(py310) root@autodl-container-bbd74aba75-f6db4254:~/autodl-tmp/pruning/torch_pruning/Torch-Pruning/examples/LLMs# python prune_llm.py --model /root/autodl-tmp/model/q7obo --pruning_ratio 0.428571428 --max_seq_len 4096 --save_model /root/autodl-tmp/model/q7bp torch 2.6.0 transformers 4.49.0 accelerate 1.4.0
of gpus: 2
loading llm model /root/autodl-tmp/model/q7obo
Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.17s/it]
We've detected an older driver with an RTX 4000 series GPU. These drivers have issues with P2P. This can affect the multi-gpu inference when using accelerate device_map.Please make sure to update your driver to the latest version which resolves this.
use device cuda:0
----------------- Before Pruning -----------------
Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2DecoderLayer(
(self_attn): Qwen2Attention(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2RotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)
/root/miniconda3/envs/py310/lib/python3.10/site-packages/torch_pruning/dependency.py:699: UserWarning: Unwrapped parameters detected: ['model.layers.1.input_layernorm.weight', 'model.layers.4.post_attention_layernorm.weight', 'model.layers.9.input_layernorm.weight', 'model.layers.11.input_layernorm.weight', 'model.layers.13.input_layernorm.weight', 'model.layers.14.post_attention_layernorm.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.6.post_attention_layernorm.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.18.post_attention_layernorm.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.5.post_attention_layernorm.weight', 'model.layers.10.input_layernorm.weight', 'model.layers.13.post_attention_layernorm.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.18.input_layernorm.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.27.input_layernorm.weight', 'model.norm.weight', 'model.layers.3.input_layernorm.weight', 'model.layers.7.post_attention_layernorm.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.2.post_attention_layernorm.weight', 'model.layers.5.input_layernorm.weight', 'model.layers.12.input_layernorm.weight', 'model.layers.12.post_attention_layernorm.weight', 'model.layers.14.input_layernorm.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.7.input_layernorm.weight', 'model.layers.8.input_layernorm.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.2.input_layernorm.weight', 'model.layers.3.post_attention_layernorm.weight', 'model.layers.4.input_layernorm.weight', 'model.layers.11.post_attention_layernorm.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.6.input_layernorm.weight', 'model.layers.8.post_attention_layernorm.weight', 'model.layers.9.post_attention_layernorm.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.25.post_attention_layernorm.weight'].
Torch-Pruning will prune the last non-singleton dimension of these parameters. If you wish to change this behavior, please provide an unwrapped_parameters argument.
warnings.warn(warning_str)
Traceback (most recent call last):
File "/root/autodl-tmp/pruning/torch_pruning/Torch-Pruning/examples/LLMs/prune_llm.py", line 390, in
i have tried multiple time with qwen 2.5 7b, qwen 2.5 3b-instruct, qwen 2.5 7b-instruct, all have the same issue
I have got this issue with qwen2-7b
I have got this issue with qwen2.5-7b
I have the same issue with llamas and phi, even though I follow their instructions from here. Am I the only one to encounter that?
as suggested here
Add model.config.use_cache = False
After adding this model.config.use_cache = False, how long does it take to prun the model? Since cache is used to speed up the process, would it cost so many times longer?
After adding this
model.config.use_cache = False, how long does it take to prun the model? Since cache is used to speed up the process, would it cost so many times longer?
Pruning is not the "heavy" part of the process, it doesn't take too long actually, you should give it a try.
After fixing the 'grad_fn' issue , I got another error: File "/mnt/xcli/pruning/codes/Torch-Pruning/examples/LLMs/prune_llm.py", line 363, in main m.num_key_value_groups = m.num_heads // m.num_key_value_heads File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2Attention' object has no attribute 'num_key_value_heads'
if I Comment out “ m.num_key_value_groups = m.num_heads // m.num_key_value_heads” , then I got this error: File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 192, in forward attn_output, attn_weights = attention_interface( File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/integrations/sdpa_attention.py", line 53, in sdpa_attention_forward attn_output = torch.nn.functional.scaled_dot_product_attention( RuntimeError: The size of tensor a (16) must match the size of tensor b (28) at non-singleton dimension 1
QAQ How to fix this?
After fixing the 'grad_fn' issue , I got another error: File "/mnt/xcli/pruning/codes/Torch-Pruning/examples/LLMs/prune_llm.py", line 363, in main m.num_key_value_groups = m.num_heads // m.num_key_value_heads File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2Attention' object has no attribute 'num_key_value_heads'
if I Comment out “ m.num_key_value_groups = m.num_heads // m.num_key_value_heads” , then I got this error: File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 192, in forward attn_output, attn_weights = attention_interface( File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/integrations/sdpa_attention.py", line 53, in sdpa_attention_forward attn_output = torch.nn.functional.scaled_dot_product_attention( RuntimeError: The size of tensor a (16) must match the size of tensor b (28) at non-singleton dimension 1
QAQ How to fix this?
according to chatgpt's suggestion,
i set prune_num_heads=False and changed
for name, m in model.named_modules():
if name.endswith("self_attn"):
if seperate_qkv:
m.hidden_size = m.q_proj.out_features
else:
m.hidden_size = m.qkv_proj.out_features // 3
# 这里直接使用原始配置的头数,不重新计算
m.num_heads = model.config.num_attention_heads
# 如果不剪枝头,则保持原始的 num_key_value_heads 和计算对应的 groups
m.num_key_value_groups = model.config.num_attention_heads // model.config.num_key_value_heads
elif name.endswith("mlp"):
if hasattr(m, "gate_proj"):
m.hidden_size = m.gate_proj.in_features
model.config.intermediate_size = m.gate_proj.out_features
elif hasattr(m, "gate_up_proj"):
m.hidden_size = m.gate_up_proj.in_features
model.config.intermediate_size = m.gate_up_proj.out_features // 2
else:
raise ValueError("Unknown mlp layer")
and it worked, but not usable after prune
After fixing the 'grad_fn' issue , I got another error: File "/mnt/xcli/pruning/codes/Torch-Pruning/examples/LLMs/prune_llm.py", line 363, in main m.num_key_value_groups = m.num_heads // m.num_key_value_heads File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2Attention' object has no attribute 'num_key_value_heads' if I Comment out “ m.num_key_value_groups = m.num_heads // m.num_key_value_heads” , then I got this error: File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 192, in forward attn_output, attn_weights = attention_interface( File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/integrations/sdpa_attention.py", line 53, in sdpa_attention_forward attn_output = torch.nn.functional.scaled_dot_product_attention( RuntimeError: The size of tensor a (16) must match the size of tensor b (28) at non-singleton dimension 1 QAQ How to fix this?
according to chatgpt's suggestion, i set prune_num_heads=False and changed for name, m in model.named_modules(): if name.endswith("self_attn"): if seperate_qkv: m.hidden_size = m.q_proj.out_features else: m.hidden_size = m.qkv_proj.out_features // 3 # 这里直接使用原始配置的头数,不重新计算 m.num_heads = model.config.num_attention_heads # 如果不剪枝头,则保持原始的 num_key_value_heads 和计算对应的 groups m.num_key_value_groups = model.config.num_attention_heads // model.config.num_key_value_heads elif name.endswith("mlp"): if hasattr(m, "gate_proj"): m.hidden_size = m.gate_proj.in_features model.config.intermediate_size = m.gate_proj.out_features elif hasattr(m, "gate_up_proj"): m.hidden_size = m.gate_up_proj.in_features model.config.intermediate_size = m.gate_up_proj.out_features // 2 else: raise ValueError("Unknown mlp layer") and it worked, but not usable after prune
Is the model unusable after pruning because the inference code needs to be modified accordingly?
Anyone could share the inference code for the pruned model?
The code in readme works fine for me (Qwen/Qwen2.5-0.5B-Instruct). But the output is not readable.
Original response is: Give me a short introduction to large language model. Certainly! A large language model (LLM) is a type of artificial intelligence that can generate human-like text based on the input it receives. These models typically use deep learning techniques and neural networks to learn from vast amounts of data, allowing them to understand complex patterns, context, and nuances in natural language. They have been used in various applications such as chatbots, virtual assistants, legal research, and more.
After pruning: Give me a short introduction to large language model. 10/i2etices262111124222111121111732115111111111111171111111115021150100011 / 0111 � 100 /11 0111 / t g0 g11 g01111 and1111101101121111111110011111101111111111111101111100111111111111011111110111111101011211110111011011141101001111 g11 g111 g g i g g1 g g g g11 n111101111111111111111111011111111110111011111111110111110111101111011111111111111011111111111014141211101111111104111111110104111111101011111111111111111111111111111011111111111110110011111011111111111014111211111111111111110111111111111111111110111111111111111111111111111
Or like: Give me a short introduction to large language model.
-
-
-
-
- ( - — - - - - - - as - .. - - - - — - - - - - - - # - - - - # [ -4 - — - - - right ( -- of - - - - - — — -- — - - - - - - - - - - - - - - - - - - - — - - - - - -- - - - - - - - - - — - - - - - - -4 -84 - - -4 - - —6 -2 - - - - - -- -2 - - - - - - - - - -14 - -4 - - - - , -4 - -2 - - - 2 - -24 - - 42 6 - - - - 42 2 -4 44 - - 744 - 2 4 - - - - -4 - - 4 - - - 4 4 4 4 -4 -4 4 4 4 - ( 4443 4 44 4 64 44 4 - 49 - 4 43 44 46 -7 - 44444342 - 6 4 4474 4444 7 2 2472 - - 44 47 - - 4 44 4 47 - var4 - 6 4 2 2 -424 2466 4
-
-
-
So, I tried to modify prune_llm.py to ignore embed_tokens layer and set prune_num_heads to False, then the response is empty...