(py310) root@autodl-container-bbd74aba75-f6db4254:~/autodl-tmp/pruning/torch_pruning/Torch-Pruning/examples/LLMs# python prune_llm.py --model /root/autodl-tmp/model/q7obo --pruning_ratio 0.428571428 --max_seq_len 4096 --save_model /root/autodl-tmp/model/q7bp torch 2.6.0 transformers 4.49.0 accelerate 1.4.0

of gpus: 2

loading llm model /root/autodl-tmp/model/q7obo Sliding Window Attention is enabled but not implemented for sdpa; unexpected results may be encountered. Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.17s/it] We've detected an older driver with an RTX 4000 series GPU. These drivers have issues with P2P. This can affect the multi-gpu inference when using accelerate device_map.Please make sure to update your driver to the latest version which resolves this. use device cuda:0 ----------------- Before Pruning ----------------- Qwen2ForCausalLM( (model): Qwen2Model( (embed_tokens): Embedding(152064, 3584) (layers): ModuleList( (0-27): 28 x Qwen2DecoderLayer( (self_attn): Qwen2Attention( (q_proj): Linear(in_features=3584, out_features=3584, bias=True) (k_proj): Linear(in_features=3584, out_features=512, bias=True) (v_proj): Linear(in_features=3584, out_features=512, bias=True) (o_proj): Linear(in_features=3584, out_features=3584, bias=False) ) (mlp): Qwen2MLP( (gate_proj): Linear(in_features=3584, out_features=18944, bias=False) (up_proj): Linear(in_features=3584, out_features=18944, bias=False) (down_proj): Linear(in_features=18944, out_features=3584, bias=False) (act_fn): SiLU() ) (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06) ) ) (norm): Qwen2RMSNorm((3584,), eps=1e-06) (rotary_emb): Qwen2RotaryEmbedding() ) (lm_head): Linear(in_features=3584, out_features=152064, bias=False) ) /root/miniconda3/envs/py310/lib/python3.10/site-packages/torch_pruning/dependency.py:699: UserWarning: Unwrapped parameters detected: ['model.layers.1.input_layernorm.weight', 'model.layers.4.post_attention_layernorm.weight', 'model.layers.9.input_layernorm.weight', 'model.layers.11.input_layernorm.weight', 'model.layers.13.input_layernorm.weight', 'model.layers.14.post_attention_layernorm.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.6.post_attention_layernorm.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.18.post_attention_layernorm.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.5.post_attention_layernorm.weight', 'model.layers.10.input_layernorm.weight', 'model.layers.13.post_attention_layernorm.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.18.input_layernorm.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.27.input_layernorm.weight', 'model.norm.weight', 'model.layers.3.input_layernorm.weight', 'model.layers.7.post_attention_layernorm.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.2.post_attention_layernorm.weight', 'model.layers.5.input_layernorm.weight', 'model.layers.12.input_layernorm.weight', 'model.layers.12.post_attention_layernorm.weight', 'model.layers.14.input_layernorm.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.7.input_layernorm.weight', 'model.layers.8.input_layernorm.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.2.input_layernorm.weight', 'model.layers.3.post_attention_layernorm.weight', 'model.layers.4.input_layernorm.weight', 'model.layers.11.post_attention_layernorm.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.6.input_layernorm.weight', 'model.layers.8.post_attention_layernorm.weight', 'model.layers.9.post_attention_layernorm.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.25.post_attention_layernorm.weight']. Torch-Pruning will prune the last non-singleton dimension of these parameters. If you wish to change this behavior, please provide an unwrapped_parameters argument. warnings.warn(warning_str) Traceback (most recent call last): File "/root/autodl-tmp/pruning/torch_pruning/Torch-Pruning/examples/LLMs/prune_llm.py", line 390, in main() File "/root/autodl-tmp/pruning/torch_pruning/Torch-Pruning/examples/LLMs/prune_llm.py", line 317, in main pruner = tp.pruner.MetaPruner( File "/root/miniconda3/envs/py310/lib/python3.10/site-packages/torch_pruning/pruner/algorithms/metapruner.py", line 134, in init self.DG = dependency.DependencyGraph().build_dependency( File "/root/miniconda3/envs/py310/lib/python3.10/site-packages/torch_pruning/dependency.py", line 386, in build_dependency self.module2node = self._trace( File "/root/miniconda3/envs/py310/lib/python3.10/site-packages/torch_pruning/dependency.py", line 799, in _trace module2node, o.grad_fn, gradfn2module, reused, visited=visited) AttributeError: 'tuple' object has no attribute 'grad_fn'

i have tried multiple time with qwen 2.5 7b, qwen 2.5 3b-instruct, qwen 2.5 7b-instruct, all have the same issue

Feb 20 '25 22:02 ryan0980

I have got this issue with qwen2-7b

Feb 25 '25 06:02 MissyLee2018

I have got this issue with qwen2.5-7b

Feb 26 '25 02:02 Pe4nutZz

I have the same issue with llamas and phi, even though I follow their instructions from here. Am I the only one to encounter that?

Feb 28 '25 11:02 rmakarovv

as suggested here Add model.config.use_cache = False

Mar 07 '25 00:03 Cyber-Vadok

After adding this model.config.use_cache = False, how long does it take to prun the model? Since cache is used to speed up the process, would it cost so many times longer?

Mar 07 '25 06:03 Sting-Scorpion

After adding this model.config.use_cache = False, how long does it take to prun the model? Since cache is used to speed up the process, would it cost so many times longer?

Pruning is not the "heavy" part of the process, it doesn't take too long actually, you should give it a try.

Mar 07 '25 14:03 Cyber-Vadok

After fixing the 'grad_fn' issue , I got another error: File "/mnt/xcli/pruning/codes/Torch-Pruning/examples/LLMs/prune_llm.py", line 363, in main m.num_key_value_groups = m.num_heads // m.num_key_value_heads File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2Attention' object has no attribute 'num_key_value_heads'

if I Comment out “ m.num_key_value_groups = m.num_heads // m.num_key_value_heads” ， then I got this error: File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 192, in forward attn_output, attn_weights = attention_interface( File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/integrations/sdpa_attention.py", line 53, in sdpa_attention_forward attn_output = torch.nn.functional.scaled_dot_product_attention( RuntimeError: The size of tensor a (16) must match the size of tensor b (28) at non-singleton dimension 1

QAQ How to fix this?

Mar 10 '25 07:03 MissyLee2018

After fixing the 'grad_fn' issue , I got another error: File "/mnt/xcli/pruning/codes/Torch-Pruning/examples/LLMs/prune_llm.py", line 363, in main m.num_key_value_groups = m.num_heads // m.num_key_value_heads File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2Attention' object has no attribute 'num_key_value_heads'

if I Comment out “ m.num_key_value_groups = m.num_heads // m.num_key_value_heads” ， then I got this error: File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 192, in forward attn_output, attn_weights = attention_interface( File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/integrations/sdpa_attention.py", line 53, in sdpa_attention_forward attn_output = torch.nn.functional.scaled_dot_product_attention( RuntimeError: The size of tensor a (16) must match the size of tensor b (28) at non-singleton dimension 1

QAQ How to fix this?

according to chatgpt's suggestion, i set prune_num_heads=False and changed for name, m in model.named_modules(): if name.endswith("self_attn"): if seperate_qkv: m.hidden_size = m.q_proj.out_features else: m.hidden_size = m.qkv_proj.out_features // 3
# 这里直接使用原始配置的头数，不重新计算 m.num_heads = model.config.num_attention_heads
# 如果不剪枝头，则保持原始的 num_key_value_heads 和计算对应的 groups m.num_key_value_groups = model.config.num_attention_heads // model.config.num_key_value_heads elif name.endswith("mlp"): if hasattr(m, "gate_proj"): m.hidden_size = m.gate_proj.in_features model.config.intermediate_size = m.gate_proj.out_features elif hasattr(m, "gate_up_proj"): m.hidden_size = m.gate_up_proj.in_features model.config.intermediate_size = m.gate_up_proj.out_features // 2 else: raise ValueError("Unknown mlp layer") and it worked, but not usable after prune

Mar 18 '25 23:03 ryan0980

After fixing the 'grad_fn' issue , I got another error: File "/mnt/xcli/pruning/codes/Torch-Pruning/examples/LLMs/prune_llm.py", line 363, in main m.num_key_value_groups = m.num_heads // m.num_key_value_heads File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'Qwen2Attention' object has no attribute 'num_key_value_heads' if I Comment out “ m.num_key_value_groups = m.num_heads // m.num_key_value_heads” ， then I got this error: File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 192, in forward attn_output, attn_weights = attention_interface( File "/mnt/xcli/envs/llama_cpp/lib/python3.9/site-packages/transformers/integrations/sdpa_attention.py", line 53, in sdpa_attention_forward attn_output = torch.nn.functional.scaled_dot_product_attention( RuntimeError: The size of tensor a (16) must match the size of tensor b (28) at non-singleton dimension 1 QAQ How to fix this?

according to chatgpt's suggestion, i set prune_num_heads=False and changed for name, m in model.named_modules(): if name.endswith("self_attn"): if seperate_qkv: m.hidden_size = m.q_proj.out_features else: m.hidden_size = m.qkv_proj.out_features // 3 # 这里直接使用原始配置的头数，不重新计算 m.num_heads = model.config.num_attention_heads # 如果不剪枝头，则保持原始的 num_key_value_heads 和计算对应的 groups m.num_key_value_groups = model.config.num_attention_heads // model.config.num_key_value_heads elif name.endswith("mlp"): if hasattr(m, "gate_proj"): m.hidden_size = m.gate_proj.in_features model.config.intermediate_size = m.gate_proj.out_features elif hasattr(m, "gate_up_proj"): m.hidden_size = m.gate_up_proj.in_features model.config.intermediate_size = m.gate_up_proj.out_features // 2 else: raise ValueError("Unknown mlp layer") and it worked, but not usable after prune

Is the model unusable after pruning because the inference code needs to be modified accordingly?

Mar 24 '25 08:03 MissyLee2018

Anyone could share the inference code for the pruned model?

The code in readme works fine for me (Qwen/Qwen2.5-0.5B-Instruct). But the output is not readable.

Original response is: Give me a short introduction to large language model. Certainly! A large language model (LLM) is a type of artificial intelligence that can generate human-like text based on the input it receives. These models typically use deep learning techniques and neural networks to learn from vast amounts of data, allowing them to understand complex patterns, context, and nuances in natural language. They have been used in various applications such as chatbots, virtual assistants, legal research, and more.

After pruning: Give me a short introduction to large language model. 10/i2etices262111124222111121111732115111111111111171111111115021150100011 / 0111 � 100 /11 0111 / t g0 g11 g01111 and1111101101121111111110011111101111111111111101111100111111111111011111110111111101011211110111011011141101001111 g11 g111 g g i g g1 g g g g11 n111101111111111111111111011111111110111011111111110111110111101111011111111111111011111111111014141211101111111104111111110104111111101011111111111111111111111111111011111111111110110011111011111111111014111211111111111111110111111111111111111110111111111111111111111111111

Or like: Give me a short introduction to large language model.

- - - - ( - — - - - - - - as - .. - - - - — - - - - - - - # - - - - # [ -4 - — - - - right ( -- of - - - - - — — -- — - - - - - - - - - - - - - - - - - - - — - - - - - -- - - - - - - - - - — - - - - - - -4 -84 - - -4 - - —6 -2 - - - - - -- -2 - - - - - - - - - -14 - -4 - - - - , -4 - -2 - - - 2 - -24 - - 42 6 - - - - 42 2 -4 44 - - 744 - 2 4 - - - - -4 - - 4 - - - 4 4 4 4 -4 -4 4 4 4 - ( 4443 4 44 4 64 44 4 - 49 - 4 43 44 46 -7 - 44444342 - 6 4 4474 4444 7 2 2472 - - 44 47 - - 4 44 4 47 - var4 - 6 4 2 2 -424 2466 4

So, I tried to modify prune_llm.py to ignore embed_tokens layer and set prune_num_heads to False, then the response is empty...

May 06 '25 09:05 liuxi0099

Qwen2.5 series error when prune

of gpus: 2