Discussion: Investigate Perf Boosts Through Pruning (DeepSparse)
Just saw this and it seems pretty crazy. I don't know exactly where to put it, but figured is worth discussing. They claim significant performance gains and pretty crazy model compression capabilities. A lot of the interesting information is straight on the readme page that I linked.
Our MLPerf Inference v3.0 submission contains the following results for the BERT-Large SQuAD v1.1 question answering task:
| Benchmark | Engine | Precision | Compressed File Size | SQuAD v1.1 F1 Score (R=X% of Base Accuracy) | Offline Throughput [samples/sec] |
|---|---|---|---|---|---|
| BERT-Large Baseline | ONNXRuntime | FP32 | 1.3 GB | 90.874 (R=100.00%) | 4.60 |
| oBERT-Large 99% | DeepSparse | INT8 | 38.2 MB | 90.03 (R=99.07%) | 1367.14 |
| oBERT-MobileBERT 99.9% | DeepSparse | INT8 | 19.45 MB | 90.80 (R=99.92%) | 3275.62 |
| oBERT-MobileBERT 99% | DeepSparse | INT8 | 9.56 MB | 90.41 (R=99.49%) | 5578.73 |
https://github.com/mlcommons/inference_results_v3.0/blob/main/open/NeuralMagic/README.md
From the linked repo:
unstructured gradual pruning, quantization-aware training, and structural distillation
I think the model layout would be very different, and further, not comparable to llama. But definitely interesting.
This may be interesting: https://github.com/horseee/LLaMA-Pruning
Pruning: The following script globally removes 50% of the dimensions of the LLaMA-7B model, resulting in a lightweight model with 1.72B parameters.
This issue was closed because it has been inactive for 14 days since being marked as stale.