Pruner-Zero
Pruner-Zero copied to clipboard
The limit of pruning rate
What is the maximum compression ratio that this article can achieve? Can it compress a 65B model to a size of 7B while maintaining the performance of the 7B model?
Thank you for your interest. In most LLM pruning studies, a 25%/50 % compression ratio is the norm; for reference, Tables 2 & 3 in our paper report experiments on 65 B and 70 B models at exactly this sparsity.
Compressing a 65 B model down to 7 B with pruning alone is impractical. A practical pipeline is:
- Structured-prune the 70 B model to 25 % sparsity, giving ≈ 52.5 B parameters.
- Quantize the remaining weights from 32-bit to 4-bit (8× compression), yielding ≈ 6.56 B parameters.