The accuracy problem
Hello, thank you for providing such a useful method, but I encountered some problems while pruning llama-7b.
Environment: python 3.10 torch 2.6.0 transformers 4.49.0 accelerate 1.5.2
my command:python prune_llm.py --model huggyllama/llama-7b --pruning_ratio 0.5 --save_model "/mnt/8tb_raid/david_model/Torch-Pruning/examples/LLMs/out/"
and I successfully prune the model:
----------------- After Pruning -----------------
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 2048)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=2048, bias=False)
(v_proj): Linear(in_features=2048, out_features=2048, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=5504, bias=False)
(up_proj): Linear(in_features=2048, out_features=5504, bias=False)
(down_proj): Linear(in_features=5504, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
)
)
(norm): LlamaRMSNorm((2048,), eps=1e-05)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=2048, out_features=32000, bias=False)
)
LlamaConfig {
"_attn_implementation_autoset": true,
"_name_or_path": "/mnt/8tb_raid/david_model/Torch-Pruning/examples/LLMs/meta-llama/",
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 2048,
"initializer_range": 0.02,
"intermediate_size": 5504,
"max_position_embeddings": 4096,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 16,
"num_hidden_layers": 32,
"num_key_value_heads": 16,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 10000.0,
"tie_word_embeddings": false,
"torch_dtype": "float16",
"transformers_version": "4.49.0",
"use_cache": false,
"vocab_size": 32000
}
num_params 1750206464
evaluating on wikitext2
nsamples 83
sample 0
sample 50
wikitext perplexity 11306.8095703125
but when I use the model,I got the very low accuracy:
I tried using LoRA to fine-tune the model, but the output is still very poor. I would like to ask if there are any ways to improve this?
Thank you!!
@VainF Do you have any suggestions? Thank you!! I think the model has already been severely damaged after pruning, so fine-tuning may not be very effective.
I have the same problem, I think that we should try not pruning the first 3-5 layers and the last 3-5 layers. I'm trying...
@Cyber-Vadok I think will have size mismatch problem when load the model after pruning? but we can try!
@Cyber-Vadok I think will have size mismatch problem when load the model after pruning? but we can try!
You are right! In LLMPruner there's the "root_instances" argument and it works fine, pruning only the flagged layers. I opened an issue just a minute ago asking if there's a way to get the same behaviour with this version of torch-pruning... #473
@Cyber-Vadok I think will have size mismatch problem when load the model after pruning? but we can try!
You are right! In LLMPruner there's the "root_instances" argument and it works fine, pruning only the flagged layers. I opened an issue just a minute ago asking if there's a way to get the same behaviour with this version of torch-pruning... #473
Huggingface transformers only supports uniform structures. So bottleneck pruning might be incompatible - we can’t save the model in the HF format. But it’s ok to save the object with torch.save(llm, PATH.pt). If you would like to do bottleneck pruning, a simple solution is to include the embedding layer in the ignored_layers, since all embedding_dims are coupled.
An example is available here.
BTW, it's a good idea to add something like root_modules. Will update the pruner in the next version.