ZipCache
ZipCache copied to clipboard
[NeurIPS 2024] The official implementation of ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
This repository provides the implementation for our paper "ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification". Our approach introduces an adaptive KV cache mixed-precision quantization method for LLMs.
Getting Started
Follow the step-by-step tutorial to set up ZipCache.
Step 1: Setup
Create a virtual environment and install dependencies as specified by requirements.txt. Then install flash_attn and zipcache as follows:
pip install packaging ninja
pip install flash-attn --no-build-isolation
pip install -e .
Step 2: Download Pretrained Models
Download the pretrained LLaMA model from huggingface and modify the MODEL_PATH in zipcache_generation_demo.py.
Step 3: Inference with ZipCache
python3 zipcache_generation_demo.py
BibTeX
If you find this work useful for your research, please consider citing:
@article{he2024zipcache,
title={ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification},
author={He, Yefei and Zhang, Luoming and Wu, Weijia and Liu, Jing and Zhou, Hong and Zhuang, Bohan},
journal={arXiv preprint arXiv:2405.14256},
year={2024}
}