Torch-Pruning The accuracy problem

Hello, thank you for providing such a useful method, but I encountered some problems while pruning llama-7b.

Environment: python 3.10 torch 2.6.0 transformers 4.49.0 accelerate 1.5.2

my command:python prune_llm.py --model huggyllama/llama-7b --pruning_ratio 0.5 --save_model "/mnt/8tb_raid/david_model/Torch-Pruning/examples/LLMs/out/"

and I successfully prune the model:

----------------- After Pruning -----------------
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5504, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5504, bias=False)
          (down_proj): Linear(in_features=5504, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=32000, bias=False)
)
LlamaConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "/mnt/8tb_raid/david_model/Torch-Pruning/examples/LLMs/meta-llama/",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 5504,
  "max_position_embeddings": 4096,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 16,
  "num_hidden_layers": 32,
  "num_key_value_heads": 16,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.49.0",
  "use_cache": false,
  "vocab_size": 32000
}

num_params 1750206464
evaluating on wikitext2
nsamples 83
sample 0
sample 50
wikitext perplexity 11306.8095703125

but when I use the model,I got the very low accuracy:

I tried using LoRA to fine-tune the model, but the output is still very poor. I would like to ask if there are any ways to improve this?

Thank you!!

Mar 19 '25 14:03 davidray222

@VainF Do you have any suggestions? Thank you!! I think the model has already been severely damaged after pruning, so fine-tuning may not be very effective.

Mar 20 '25 16:03 davidray222

I have the same problem, I think that we should try not pruning the first 3-5 layers and the last 3-5 layers. I'm trying...

Mar 26 '25 08:03 Cyber-Vadok

@Cyber-Vadok I think will have size mismatch problem when load the model after pruning? but we can try!

Mar 26 '25 09:03 davidray222

@Cyber-Vadok I think will have size mismatch problem when load the model after pruning? but we can try!

You are right! In LLMPruner there's the "root_instances" argument and it works fine, pruning only the flagged layers. I opened an issue just a minute ago asking if there's a way to get the same behaviour with this version of torch-pruning... #473

Mar 26 '25 10:03 Cyber-Vadok

@Cyber-Vadok I think will have size mismatch problem when load the model after pruning? but we can try!

You are right! In LLMPruner there's the "root_instances" argument and it works fine, pruning only the flagged layers. I opened an issue just a minute ago asking if there's a way to get the same behaviour with this version of torch-pruning... #473

Huggingface transformers only supports uniform structures. So bottleneck pruning might be incompatible - we can’t save the model in the HF format. But it’s ok to save the object with torch.save(llm, PATH.pt). If you would like to do bottleneck pruning, a simple solution is to include the embedding layer in the ignored_layers, since all embedding_dims are coupled.

An example is available here.

BTW, it's a good idea to add something like root_modules. Will update the pruner in the next version.

Mar 28 '25 13:03 VainF