LLM-Compressive: Longitudinal Evaluation of LLMs via Data Compression

Compression is believed to be the key feature of intelligence. Llm-compressive allows you to evaluate Large Language Models (LLMs) for generalization and robustness via data compression.

Llm-compressive tests LLMs with data compression on timeline, to understand how LLMs generalize over time.

For example, llm-compressive test open source LLMs on wikipedia across 83 months from 2017 to 2023.

Mistral and Baichuan2 show steady performance across all time periods, indicating promissing generalization over time. In contrast, other models demonstrate linearly-worsen curves.

More results on coding, arxiv, news, image, and audio in the paper: Evaluating Large Language Models for Generalization and Robustness via Data Compression .

Updates:

27 Feb 2024, try the interactive leaderboard at LLM-Compressive.

Getting Started

Clone and install requirements.

git clone https://github.com/liyucheng09/llm-compressive.git
cd llm-compressive
pip install -r requirements.txt

Run the main test script.

python main.py <model_name> <dataset_name> <save_path> <context_size> <batch_size>

model_name: the name of the model from HF Hub. See supported models.
dataset_name: the name of the dataset. choose from wikitext, math, bbc_news, code, arxiv, audio, bbc_image.
save_path: the path to save the results.
context_size: the context size used for compression. choose from 2048, 4096, 8192, max_length, or stride.
batch_size: the batch size. This depends on the model scale and your GPU memory.

Attention!!, if you need to use huggingface mirror (which means you have problem accessing huggingface.co directly), add HF_ENDPOINT=https://hf-mirror.com in your environment variables.

Aggregate the results.

python results/aggregate_all_results.py <save_path>

save_path: the path you saved the results in.

Visualize the results.

python visualise/timeline_vis.py

This will generate a figure visualizing the trend of models' compression rate over time.

python visualise/big_table.py

This will 1) generate the big table in the paper; 2) generate a figure showing the performance-robustness trade-off of models (like the figure below).

see the explaination of the figure in the paper.

Models

We have tested the following models:

codellama/CodeLlama-7b-hf
baichuan-inc/Baichuan2-7B-Base
mistralai/Mistral-7B-v0.1
huggyllama/llama-7b
huggyllama/llama-13b
huggyllama/llama-65b
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
meta-llama/Llama-2-70b-hf
Qwen/Qwen-7B
internlm/internlm-7b
THUDM/chatglm3-6b-base
01-ai/Yi-6B-200K
01-ai/Yi-34B-200K
google/gemma-7b
Qwen/Qwen1.5-7B

And any GPTQ version of the above models, such as:

TheBloke/CodeLlama-70B-hf-GPTQ
TheBloke/Llama-2-70B-GPTQ
TheBloke/Yi-34B-200K-GPTQ
...

Issues

send me emails or open issues if you have any questions.

Citation

If you find this repo helpful, please consider citing our paper:

@article{Li2024EvaluatingLL,
  title={Evaluating Large Language Models for Generalization and Robustness via Data Compression},
  author={Yucheng Li and Yunhao Guo and Frank Guerin and Chenghua Lin},
  year={2024},
  journal={arXiv preprint arXiv:2402.00861}
}

llm-compressive
llm-compressive copied to clipboard

Metadata

LLM-Compressive: Longitudinal Evaluation of LLMs via Data Compression

Getting Started

Models

Issues

Citation

← Metadata

Owner

Metadata

llm-compressive llm-compressive copied to clipboard

Metadata

LLM-Compressive: Longitudinal Evaluation of LLMs via Data Compression

Getting Started

Models

Issues

Citation

← Metadata

Owner

Metadata

llm-compressive
llm-compressive copied to clipboard