llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Discussion: Investigate Perf Boosts Through Pruning (DeepSparse)

Open MillionthOdin16 opened this issue 2 years ago • 2 comments

Just saw this and it seems pretty crazy. I don't know exactly where to put it, but figured is worth discussing. They claim significant performance gains and pretty crazy model compression capabilities. A lot of the interesting information is straight on the readme page that I linked.

Neural Magic Repo Link

Our MLPerf Inference v3.0 submission contains the following results for the BERT-Large SQuAD v1.1 question answering task:

Benchmark Engine Precision Compressed File Size SQuAD v1.1 F1 Score (R=X% of Base Accuracy) Offline Throughput [samples/sec]
BERT-Large Baseline ONNXRuntime FP32 1.3 GB 90.874 (R=100.00%) 4.60
oBERT-Large 99% DeepSparse INT8 38.2 MB 90.03 (R=99.07%) 1367.14
oBERT-MobileBERT 99.9% DeepSparse INT8 19.45 MB 90.80 (R=99.92%) 3275.62
oBERT-MobileBERT 99% DeepSparse INT8 9.56 MB 90.41 (R=99.49%) 5578.73

https://github.com/mlcommons/inference_results_v3.0/blob/main/open/NeuralMagic/README.md

MillionthOdin16 avatar Apr 13 '23 02:04 MillionthOdin16

From the linked repo:

unstructured gradual pruning, quantization-aware training, and structural distillation

I think the model layout would be very different, and further, not comparable to llama. But definitely interesting.

jon-chuang avatar Apr 13 '23 03:04 jon-chuang

This may be interesting: https://github.com/horseee/LLaMA-Pruning

Pruning: The following script globally removes 50% of the dimensions of the LLaMA-7B model, resulting in a lightweight model with 1.72B parameters.

slaren avatar Apr 15 '23 19:04 slaren

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 11 '24 01:04 github-actions[bot]