llama.cpp Discussion: Investigate Perf Boosts Through Pruning (DeepSparse)

Just saw this and it seems pretty crazy. I don't know exactly where to put it, but figured is worth discussing. They claim significant performance gains and pretty crazy model compression capabilities. A lot of the interesting information is straight on the readme page that I linked.

Neural Magic Repo Link

Our MLPerf Inference v3.0 submission contains the following results for the BERT-Large SQuAD v1.1 question answering task:

Benchmark	Engine	Precision	Compressed File Size	SQuAD v1.1 F1 Score (R=X% of Base Accuracy)	Offline Throughput [samples/sec]
BERT-Large Baseline	ONNXRuntime	FP32	1.3 GB	90.874 (R=100.00%)	4.60
oBERT-Large 99%	DeepSparse	INT8	38.2 MB	90.03 (R=99.07%)	1367.14
oBERT-MobileBERT 99.9%	DeepSparse	INT8	19.45 MB	90.80 (R=99.92%)	3275.62
oBERT-MobileBERT 99%	DeepSparse	INT8	9.56 MB	90.41 (R=99.49%)	5578.73

https://github.com/mlcommons/inference_results_v3.0/blob/main/open/NeuralMagic/README.md

Apr 13 '23 02:04 MillionthOdin16

From the linked repo:

unstructured gradual pruning, quantization-aware training, and structural distillation

I think the model layout would be very different, and further, not comparable to llama. But definitely interesting.

Apr 13 '23 03:04 jon-chuang

This may be interesting: https://github.com/horseee/LLaMA-Pruning

Pruning: The following script globally removes 50% of the dimensions of the LLaMA-7B model, resulting in a lightweight model with 1.72B parameters.

Apr 15 '23 19:04 slaren

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 11 '24 01:04 github-actions[bot]