wanda
wanda copied to clipboard
the question about LLM inference performance
Thank you for providing such outstanding research!
I tested the llama7b model, and after pruning, both the memory usage and inference speed are not significantly different from the original model. May I ask if you mentioned any methods to accelerate inference for pruned models?
GPU:NVIDIA A6000
torch 2.2.0
transformers 4.31.0
accelerate 0.21.0