The limit of pruning rate

Open hsb1995 opened this issue 6 months ago • 1 comments

What is the maximum compression ratio that this article can achieve? Can it compress a 65B model to a size of 7B while maintaining the performance of the 7B model?

Aug 26 '25 02:08 hsb1995

Thank you for your interest. In most LLM pruning studies, a 25%/50 % compression ratio is the norm; for reference, Tables 2 & 3 in our paper report experiments on 65 B and 70 B models at exactly this sparsity.

Compressing a 65 B model down to 7 B with pruning alone is impractical. A practical pipeline is:

Structured-prune the 70 B model to 25 % sparsity, giving ≈ 52.5 B parameters.
Quantize the remaining weights from 32-bit to 4-bit (8× compression), yielding ≈ 6.56 B parameters.

Sep 01 '25 02:09 pprp