cold-compress
cold-compress copied to clipboard
Question on Performance Comparison using Different Cache Bit Precision
Testing the impact of KV cache quantization on the performance of llama2 model demonstrates decrease in tokens/sec as the cache bits is reduced. However, the reduction in cache memory is observed.
Command:
python generate.py --cache_strategy full --prompt "What is a cold compress?" --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --device cuda:0 --cache_bits 4/8/16
Bits: 4
- Decode tokens per sec: 13.57
- Cache Memory used: 0.07 GB
Bits: 8
- Decode tokens per sec: 17.56
- Memory used: 0.13 GB
Bits: 16
- Decode tokens per sec: 26.09
- Memory used: 0.26 GB
Is this reduction in performance expected? Is this caused due to extra quantize-dequantize operations?