KVQuant icon indicating copy to clipboard operation
KVQuant copied to clipboard

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Results 7 KVQuant issues
Sort by recently updated
recently updated
newest added

Excuse me, when executing cache-llama-activations.py in the deployment directory to generate activations.pickle, an assert (False) error is raised in the QuantK class's parallel_pack function in deployment/transformers/src/transformers/models/llama/modeling_llama.py file, with self.include_sparse being...

Hi, @chooper1 We need to use some calibration datasets to do quantization in the exp, but this type of sequence is too long to run in even 80G GPU for...

Thank you for your excellent work! Currently, I am trying to reproduce KVQaunt but have encountered some errors. Your assistance with this matter would be appreciated. ### 1. Reproduce the...

Thanks for your great work and the open-sourced code! I have some problems with the storage of sparse matrix. Could you please provide the code to reproduce Table 10 in...

Thank you for your great work! Now I want to reproduce the Perlexity of LLaMA-7B on Wikitext-2 with the method of "ATOM-4bit", but I can not find the code in...

when I try CUDA_VISIBLE_DEVICES=0 python llama_simquant.py --abits 4 --nsamples 16 --seqlen 2048 --nuq --fisher --quantize --include_sparse --sparsity-threshold 0.99 --quantizer_path quantizers.pickle ; get this error AttributeError: 'LlamaModel' object has no attribute...

Thanks for the great work! I am curious about the time complexity of the pre-rope quantization. In detail, I assume the operations act as the following orders with pre-rope quant...