cold-compress icon indicating copy to clipboard operation
cold-compress copied to clipboard

Question on Performance Comparison using Different Cache Bit Precision

Open soumendukrg opened this issue 1 year ago • 0 comments

Testing the impact of KV cache quantization on the performance of llama2 model demonstrates decrease in tokens/sec as the cache bits is reduced. However, the reduction in cache memory is observed.

Command: python generate.py --cache_strategy full --prompt "What is a cold compress?" --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --device cuda:0 --cache_bits 4/8/16

Bits: 4

  • Decode tokens per sec: 13.57
  • Cache Memory used: 0.07 GB

Bits: 8

  • Decode tokens per sec: 17.56
  • Memory used: 0.13 GB

Bits: 16

  • Decode tokens per sec: 26.09
  • Memory used: 0.26 GB

Is this reduction in performance expected? Is this caused due to extra quantize-dequantize operations?

soumendukrg avatar Oct 19 '24 09:10 soumendukrg