cold-compress Question on Performance Comparison using Different Cache Bit Precision

Question on Performance Comparison using Different Cache Bit Precision

Open soumendukrg opened this issue 1 year ago • 0 comments

Testing the impact of KV cache quantization on the performance of llama2 model demonstrates decrease in tokens/sec as the cache bits is reduced. However, the reduction in cache memory is observed.

Command: python generate.py --cache_strategy full --prompt "What is a cold compress?" --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --device cuda:0 --cache_bits 4/8/16

Bits: 4

Decode tokens per sec: 13.57
Cache Memory used: 0.07 GB

Bits: 8

Decode tokens per sec: 17.56
Memory used: 0.13 GB

Bits: 16

Decode tokens per sec: 26.09
Memory used: 0.26 GB

Is this reduction in performance expected? Is this caused due to extra quantize-dequantize operations?

Oct 19 '24 09:10 soumendukrg

cold-compress cold-compress copied to clipboard

Question on Performance Comparison using Different Cache Bit Precision

cold-compress
cold-compress copied to clipboard