wanda icon indicating copy to clipboard operation
wanda copied to clipboard

the question about LLM inference performance

Open davidray222 opened this issue 9 months ago • 0 comments

Thank you for providing such outstanding research!

I tested the llama7b model, and after pruning, both the memory usage and inference speed are not significantly different from the original model. May I ask if you mentioned any methods to accelerate inference for pruned models?

GPU:NVIDIA A6000
torch 2.2.0
transformers 4.31.0
accelerate 0.21.0

davidray222 avatar Mar 20 '25 19:03 davidray222