llm-compression-benchmark
llm-compression-benchmark copied to clipboard
LLM Compression Benchmark
LLM Compression Benchmark
Made in Vancouver, Canada by Picovoice
This repository is a minimalist and extensible framework for benchmarking LLM compression algorithms.
Table of Contents
-
Algorithms
- GPTQ
- picoLLM Compression
-
Tasks
- MMLU Score
- ARC Score
- Perplexity Loss
-
Data
- MMLU
- ARC
- Perplexity (C4)
- Quantization (C4)
- Models
- Usage
-
Results
- MMLU
- ARC-Easy
- ARC-Challenge
- Perplexity
Algorithms
GPTQ
GPTQ is arguably the most popular quantization algorithm for LLMs. GPTQ fully reconstructs weights so that the quantized version closely mimics the full-precision one.
picoLLM Compression
picoLLM Compression is Picovoice's in-house LLM compression algorithm. Given a target size, picoLLM optimally distributes available bits within and across LLM's weights.
Tasks
MMLU Score
MMLU (Massive Multitask Language Understanding) is a multiple-choice dataset that can measure the models' ability to understand natural language.
ARC Score
ARC (AI2 Reasoning Challenge) is a multiple-choice dataset that measures
the models' reasoning ability. The ARC dataset has two partitions: Easy
and Challenge
. We perform the benchmark on
both partitions and report the results separately.
Perplexity Loss
Perplexity measures the models' language modeling capabilities.
Data
The'/res' folder contains all required data for the benchmark. To reproduce it, follow the sections below.
MMLU
Download the MMLU dataset and run the following from the repository's root to extract and format it:
python3 data/mmlu.py --dataset-folder ${DATASET_FOLDER}
ARC
Download the ARC dataset and run the following from the repository's root to extract and
format the Challenge
portion:
python3 data/arc.py --dataset-folder ${DATASET_FOLDER}
Perform the above for the Easy
portion:
python3 data/arc.py --dataset-folder ${DATASET_FOLDER} --easy
Perplexity (C4)
For the perplexity measurement, we use 128 randomly selected text snippets from the validation portion of the C4 dataset. Once you download the dataset, run the following from the root of the repository to extract and normalize the data:
python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${VALIDATION_FOLDER} \
--portion validation
Replace ${REPOSITORY_FOLDER}
with the path to the downloaded dataset repository and ${VALIDATION_FOLDER}
with a
folder to hold onto the normalized data.
Then we sample 128 sequences from the normalized data:
python3 data/c4-sample.py \
--dataset-folder ${VALIDATION_FOLDER} \
--portion valid
Quantization (C4)
We need a sample dataset for quantization algorithms (GPTQ, picoLLM). We use 128 randomly selected text snippets from the train portion of the C4 dataset. Once you download the dataset, run the following from the root of the repository to extract and normalize the data:
python3 data/c4-normalize.py \
--repository-folder ${REPOSITORY_FOLDER} \
--normalized-folder ${TRAIN_FOLDER} \
--portion train
Replace ${REPOSITORY_FOLDER}
with the path to the downloaded dataset repository and ${TRAIN_FOLDER}
with a
folder to hold onto the normalized data.
Then we sample 128 sequences from the normalized data:
python3 data/c4-sample.py \
--dataset-folder ${TRAIN_FOLDER} \
--portion train
Models
We use six models:
-
Gemma-2b
-
Gemma-7b
-
Llama-2-7b
-
Llama-3-8b
-
Mistral-7b-v0.1
-
Phi-2
The corresponding picoLLM compressed models are on Picovoice Console. We create GPTQ models using the package AutoGPTQ. You can quantize the models by running the following:
python3 model/autogptq.py \
--model-uri ${MODEL_URI} \
--quantized-model-folder ${QUANTIZED_MODEL_FOLDER} \
--bits ${BITS}
Usage
To measure the MMLU score for a given model, run the following:
python3 mmlu.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
Replace ${COMPRESSION}
with the model's compression. i.e., NONE
for full-precision models, GPTQ,
or picoLLM.
To measure the ARC score for a given model, run the following:
python3 arc.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
Replace ${COMPRESSION}
with the model's compression. i.e., NONE
for full-precision models, GPTQ,
or picoLLM.
To measure the perplexity for a given model, run the following:
python3 perplexity.py \
--compression ${COMPRESSION} \
--model-uri ${MODEL_URI}
Replace ${COMPRESSION}
with the model's compression. i.e., NONE
for full-precision models, GPTQ,
or picoLLM.
When running picoLLM Compressed models, you must also provide your Picovoice AccessKey, which is available on Picovoice Console.
... --picollm-access-key ${PICOLLM_ACCESS_KEY}
Results
Below are our benchmark results comparing GPTQ against picoLLM for all models. We perform 2, 3, and 4-bit quantization using GPTQ, then find the model size in GB and set that as the target size for picoLLM Compression. Hence, both models have the same size in terms of the number of bytes. When performing GPTQ, we set the group size parameter to 128, set the damp percent to 0.1 and enabled activation reordering.
MMLU
The table below depicts the MMLU score of the original models.
Model | MMLU |
Gemma-2b 5.0G | 40.21 |
Gemma-7b 17.1G | 64.48 |
Llama-3-8b 16.1G | 64.88 |
Llama-2-7b 13.5G | 46.38 |
Mistral-7b-v0.1 15.0G | 62.41 |
Phi-2 5.6G | 56.04 |
The table below depicts the MMLU score of the quantized models.
Model | GPTQ | picoLLM |
Gemma-2b 3.1G | 39.07 | 41.12 |
Gemma-2b 2.9G | 27.51 | 41.12 |
Gemma-2b 2.6G | 24.93 | 41.12 |
Gemma-7b 7.2G | 62.58 | 64.98 |
Gemma-7b 6.2G | 53.30 | 64.57 |
Gemma-7b 5.2G | 25.58 | 64.32 |
Llama-2-7b 3.9G | 45.26 | 44.99 |
Llama-2-7b 3.1G | 40.40 | 40.68 |
Llama-2-7b 2.3G | 25.36 | 28.72 |
Llama-3-8b 5.7G | 63.09 | 64.96 |
Llama-3-8b 4.9G | 53.86 | 64.76 |
Llama-3-8b 4.0G | 25.05 | 61.26 |
Mistral-7b-v0.1 4.2G | 61.00 | 59.19 |
Mistral-7b-v0.1 3.3G | 23.73 | 57.72 |
Mistral-7b-v0.1 2.4G | 25.70 | 43.53 |
Phi-2 1.8G | 54.61 | 54.11 |
Phi-2 1.5G | 50.64 | 52.24 |
Phi-2 1.2G | 26.05 | 48.86 |
ARC Easy
The table below depicts the ARC Easy score of the original models.
Model | ARC Easy |
Gemma-2b 5.0G | 33.75 |
Gemma-7b 17.1G | 75.51 |
Llama-2-7b 13.5G | 44.87 |
Llama-3-8b 16.1G | 75.80 |
Mistral-7b-v0.1 15.0G | 80.56 |
Phi-2 5.6G | 75.25 |
The table below depicts the ARC Easy score of the quantized models.
Model | GPTQ | picoLLM |
Gemma-2b 3.1G | 30.39 | 34.39 |
Gemma-2b 2.9G | 24.37 | 34.39 |
Gemma-2b 2.6G | 23.82 | 34.39 |
Gemma-7b 7.2G | 76.52 | 84.18 |
Gemma-7b 6.2G | 44.28 | 84.51 |
Gemma-7b 5.2G | 23.95 | 84.13 |
Llama-2-7b 3.9G | 39.23 | 41.96 |
Llama-2-7b 3.1G | 32.95 | 33.96 |
Llama-2-7b 2.3G | 23.91 | 24.49 |
Llama-3-8b 5.7G | 72.85 | 78.83 |
Llama-3-8b 4.9G | 43.39 | 77.02 |
Llama-3-8b 4.0G | 24.71 | 71.76 |
Mistral-7b-v0.1 4.2G | 77.27 | 73.95 |
Mistral-7b-v0.1 3.3G | 23.91 | 72.10 |
Mistral-7b-v0.1 2.4G | 24.92 | 46.46 |
Phi-2 1.8G | 70.45 | 75.04 |
Phi-2 1.5G | 56.61 | 70.66 |
Phi-2 1.2G | 22.10 | 62.42 |
ARC Challenge
The table below depicts the ARC Challenge score of the original models.
Model | ARC Challenge |
Gemma-2b 5.0G | 30.38 |
Gemma-7b 17.1G | 64.93 |
Llama-2-7b 13.5G | 37.03 |
Llama-3-8b 16.1G | 63.05 |
Mistral-7b-v0.1 15.0G | 67.49 |
Phi-2 5.6G | 61.60 |
The table below depicts the ARC Challenge score of the quantized models.
Model | GPTQ | picoLLM |
Gemma-2b 3.1G | 26.37 | 30.97 |
Gemma-2b 2.9G | 23.55 | 30.97 |
Gemma-2b 2.6G | 24.83 | 30.97 |
Gemma-7b 7.2G | 66.30 | 72.35 |
Gemma-7b 6.2G | 33.62 | 72.35 |
Gemma-7b 5.2G | 24.06 | 72.61 |
Llama-2-7b 3.9G | 32.42 | 34.30 |
Llama-2-7b 3.1G | 27.56 | 28.24 |
Llama-2-7b 2.3G | 21.16 | 23.63 |
Llama-3-8b 5.7G | 60.24 | 64.33 |
Llama-3-8b 4.9G | 36.18 | 63.48 |
Llama-3-8b 4.0G | 23.29 | 57.85 |
Mistral-7b-v0.1 4.2G | 64.42 | 60.49 |
Mistral-7b-v0.1 3.3G | 24.06 | 59.04 |
Mistral-7b-v0.1 2.4G | 23.21 | 37.80 |
Phi-2 1.8G | 57.42 | 62.46 |
Phi-2 1.5G | 44.97 | 57.51 |
Phi-2 1.2G | 24.49 | 47.87 |
Perplexity
The table below depicts the perplexity of the original models.
Model | Perplexity |
Gemma-2b 5.0G | 16.79 |
Gemma-7b 17.1G | 14.67 |
Llama-2-7b 13.5G | 8.40 |
Llama-3-8b 16.1G | 11.61 |
Mistral-7b-v0.1 15.0G | 10.50 |
Phi-2 5.6G | 17.38 |
The table below depicts the perplexity of the quantized models.
Model | GPTQ | picoLLM |
Gemma-2b 3.1G | 17.85 | 16.86 |
Gemma-2b 2.9G | 24.11 | 16.86 |
Gemma-2b 2.6G | 8377.74 | 16.86 |
Gemma-7b 7.2G | 15.47 | 14.82 |
Gemma-7b 6.2G | 27.29 | 14.84 |
Gemma-7b 5.2G | 33370970.40 | 15.08 |
Llama-2-7b 3.9G | 8.59 | 8.50 |
Llama-2-7b 3.1G | 9.66 | 8.86 |
Llama-2-7b 2.3G | 67.43 | 10.87 |
Llama-3-8b 5.7G | 12.31 | 11.73 |
Llama-3-8b 4.9G | 17.47 | 11.90 |
Llama-3-8b 4.0G | 712.70 | 12.67 |
Mistral-7b-v0.1 4.2G | 10.43 | 10.62 |
Mistral-7b-v0.1 3.3G | 2909.83 | 10.81 |
Mistral-7b-v0.1 2.4G | 1176.43 | 14.87 |
Phi-2 1.8G | 18.15 | 17.76 |
Phi-2 1.5G | 19.94 | 18.14 |
Phi-2 1.2G | 76.55 | 20.22 |