KVQuant issues

Results 7 KVQuant issues

Sort by recently updated

The value of self.include_sparse being 0 causes the assert (False) error

Excuse me, when executing cache-llama-activations.py in the deployment directory to generate activations.pickle, an assert (False) error is raised in the QuantK class's parallel_pack function in deployment/transformers/src/transformers/models/llama/modeling_llama.py file, with self.include_sparse being...

ascendpoet

[Question] How to run HF model with 1m-length tokens in your exp?

Hi, @chooper1 We need to use some calibration datasets to do quantization in the exp, but this type of sequence is too long to run in even 80G GPU for...

1649759610

CUDA error: an illegal memory access was encountered

Thank you for your excellent work! Currently, I am trying to reproduce KVQaunt but have encountered some errors. Your assistance with this matter would be appreciated. ### 1. Reproduce the...

CUHKSZzxy

Question about storage

Thanks for your great work and the open-sourced code！ I have some problems with the storage of sparse matrix. Could you please provide the code to reproduce Table 10 in...

mlxht990720

Where is the code of "ATOM-4bit"in the KVQuant codebase?

Thank you for your great work! Now I want to reproduce the Perlexity of LLaMA-7B on Wikitext-2 with the method of "ATOM-4bit", but I can not find the code in...

leoliu1979

AttributeError: 'LlamaModel' object has no attribute 'split_gpus'

when I try CUDA_VISIBLE_DEVICES=0 python llama_simquant.py --abits 4 --nsamples 16 --seqlen 2048 --nuq --fisher --quantize --include_sparse --sparsity-threshold 0.99 --quantizer_path quantizers.pickle ; get this error AttributeError: 'LlamaModel' object has no attribute...

seeyourcell

PRE-ROPE quantization during inference

Thanks for the great work! I am curious about the time complexity of the pre-rope quantization. In detail, I assume the operations act as the following orders with pre-rope quant...

minghaoBD

KVQuant
KVQuant copied to clipboard

Metadata

The value of self.include_sparse being 0 causes the assert (False) error

[Question] How to run HF model with 1m-length tokens in your exp?