Linking-ai/SCOPE: (ACL 2025 oral) SCOPE: Optimizing KV Cache Compres...

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

¹Southeast University, ²King’s College London, ³The Alan Turing Institute

If you find our project helpful, please give us a star ⭐ on GitHub to stay updated.

Overview

SCOPE is a simple yet effective framework designed to tackle the KV cache bottleneck in large language models (LLMs) during long-context generation. While existing methods primarily focus on the prefill phase, SCOPE introduces stage-level KV cache compression, addressing both prefill and decoding phases separately—an essential improvement for long-output reasoning tasks.

SCOPE is especially useful for LLM applications that require efficient, scalable generation with long outputs.

Comparison of Three Paradigms

Overview of Three Decoding Strategies

Key Observations

Excessive compression during the prefill phase which requires specific full context, impairs the comprehension of the reasoning task.
Deviation of heavy hitters occurs in the reasoning tasks with long outputs.

Excessive compression

Deviation of heavy hitters

We provide a notebook vis_topk_index_attn.ipynb to reproduce the Deviation of heavy hitters result(1× A100 (80GB) GPU).

Visualization

Attention heatmaps for layer 13 of a simplified GSM8k+ sample in LongGenBench:

We provide a notebook vis_attn_map.ipynb to reproduce the visualization result(1× A100 (80GB) GPU). Model attention maps for different layers would be stored at ./attention_map.

Requirements

torch==2.4.0
transformers==4.44.2
flash_attn==2.5.8

Environment Setup

conda create -n SCOPE
pip install -r requirements.txt

LongGenBench

Dataset Construction

Our dataset construction method is based on the original LongGenBench repository. We provide scripts for building the LongGenBench dataset as follows:

LongGenBench-4K

Dataset Script

GSM8K+ create_gsm8k_30.sh

MMLU+ create_mmlu_30.sh

CSQA+ create_csqa_40.sh
LongGenBench-8K

Dataset Script

GSM8K++ create_gsm8k_60.sh

MMLU++ create_mmlu_60.sh

CSQA++ create_csqa_80.sh

Dataset	Script
GSM8K+	`create_gsm8k_30.sh`
MMLU+	`create_mmlu_30.sh`
CSQA+	`create_csqa_40.sh`

Dataset	Script
GSM8K++	`create_gsm8k_60.sh`
MMLU++	`create_mmlu_60.sh`
CSQA++	`create_csqa_80.sh`

Example Usage

To generate the GSM8K+ dataset, run:

bash scripts/scripts_longgenbench/create_gsm8k_30.sh

Inference in LongGenBench

export CUDA_VISIBLE_DEVICES=$1

method=$2 # Support ALLKV, PyramidKV, PyramidInfer SnapKV, H2O, StreamingLLM
max_capacity_prompts=$3
attn_implementation=$4 # Support "flash_attention_2", "sdpa", "eager".
source_path=$5
model_path=$6
decoding_metric=$7 # H2O Support None,h2o,(slide, adaptive, discontinuous)---SCOPE
decoding_window_size=$8
save_dir=$9 # path to result save_dir
K=$10 #30,60
T=$11

python3 run_longgenbench.py \
    --method ${method} \
    --model_path ${model_path} \
    --max_capacity_prompts ${max_capacity_prompts} \
    --attn_implementation ${attn_implementation} \
    --save_dir ${save_dir} \
    --use_cache True \
    --K ${K}\
    --decoding_window_size ${decoding_window_size} \
    --decoding_recent_size ${decoding_recent_size} \
    --decoding_metric ${decoding_metric} \
    --max_num_examples ${T} \

Eval Acc

results_dir=$1

python3 eval_gen.py \
    --results_dir ${results_dir}

Performence in LongGenBench (Llama3.1-8B-Instruct)

The run scripts (bash files) for these experiments are located in the scripts/scripts_longgenbench folder, and the experimental results can be found in results_longgenbench_4K and results_longgenbench_8K.

Performence on the GSM8K+ task from LONGGENBENCH-4K (Llama3.1-8B-Instruct)

The plug-in experiment results of LLaMA3.1-8B-instruct on the GSM8K+ task from LONGGENBENCH-4K.

The run scripts (bash files) for these experiments are located in the scripts/scripts_longgenbench folder, and the experimental results can be found in results_longgenbench_gsm8k_plug_in.

TODO

[X] fix offset bug
[X] improve README(expand documentation, add examples, and ensure clarity)
[X] reorgnize the code for better using experience

Citation

If you find our work valuable, we would appreciate your citation: 🎈

@article{wu2024scope,
  title={SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation},
  author={Wu, Jialong and Wang, Zhenglin and Zhang, Linhai and Lai, Yilong and He, Yulan and Zhou, Deyu},
  journal={arXiv preprint arXiv:2412.13649},
  year={2024}
}

Acknowledgements

Thanks to SnapKV and PyramidKV (KVCache-Factory for providing open-source code to support the expansion of this project. 🎁
Special thanks to LOOK-M for the beautifully designed README template, which we referenced. 🎨
Shoutout to @Lueci4er on GitHub for valuable suggestions on code details, which we adopted. 🛠️

SCOPE
SCOPE copied to clipboard

Metadata

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

If you find our project helpful, please give us a star ⭐ on GitHub to stay updated.

Overview

Key Observations

Visualization

Requirements

Environment Setup

LongGenBench

Dataset Construction

Inference in LongGenBench

Eval Acc

Performence in LongGenBench (Llama3.1-8B-Instruct)

Performence on the GSM8K+ task from LONGGENBENCH-4K (Llama3.1-8B-Instruct)

TODO

Citation

If you find our work valuable, we would appreciate your citation: 🎈

Acknowledgements

The code is still being organized.🚧

← Metadata

Owner

Metadata

SCOPE SCOPE copied to clipboard

Metadata

SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation

If you find our project helpful, please give us a star ⭐ on GitHub to stay updated.

Overview

Key Observations

Visualization

Requirements

Environment Setup

LongGenBench

Dataset Construction

Inference in LongGenBench

Eval Acc

Performence in LongGenBench (Llama3.1-8B-Instruct)

Performence on the GSM8K+ task from LONGGENBENCH-4K (Llama3.1-8B-Instruct)

TODO

Citation

If you find our work valuable, we would appreciate your citation: 🎈

Acknowledgements

The code is still being organized.🚧

← Metadata

Owner

Metadata

SCOPE
SCOPE copied to clipboard