aphrodite-engine icon indicating copy to clipboard operation
aphrodite-engine copied to clipboard

[Feature]: Exllamav2 Q4, Q6, and Q8 cache

Open Anthonyg5005 opened this issue 1 year ago • 3 comments

🚀 The feature, motivation and pitch

Only found a discussion asking about it but from evaluation it seems that Q4 is now better than FP8 and closer/almost equal to fp16 cache. I personally don't use this engine and am just looking from the outside, but I believe this may benefit some of its users who may be trying to squeeze in a bit more context without reducing the overall accuracy by much.

Additional context

Here's the evaluation between the different cache types: turboderp/exllamav2/doc/qcache_eval.md

Anthonyg5005 avatar May 09 '24 18:05 Anthonyg5005

It's definitely a planned feature. I believe @sgsdxzy wanted to work on it.

AlpinDale avatar May 09 '24 18:05 AlpinDale

alright, feel free to close this issue when that's done.

Anthonyg5005 avatar May 09 '24 19:05 Anthonyg5005

also an update on this, FP8 cache may be removed from exllamav2 sometime in the future and Q8 and Q6 cache are now in master branch

Anthonyg5005 avatar Jun 08 '24 20:06 Anthonyg5005