Eric Buehler
Eric Buehler
@sammcj @dinerburger sorry for the late reply! I've begun work in #988. > Forgive my ignorance here - when you say int4/int8 - are you talking about quantising down to...
@sammcj @dinerburger @mahmoodsh36 KV quant is finally implemented in #1400!
| Platform | TG (256) T/s | PG (22) T/s | | -- | -- | -- | | mistral.rs | 37.39 | 49.78 | | llama.cpp | 38.94 |...
| Platform | TG (256) T/s | PP (22) T/s | | -- | -- | -- | | mistral.rs | 37.39 | 274.32 | | llama.cpp | 38.94 |...
Thank you! More benchmarks below for some smaller models: ## Llama 3.2 3b | Platform | TG (256) T/s | PP (22) T/s | | -- | -- | --...
New benchmarks! #916 introduced a preallocated KV cache for better decoding effiecency. We can already see good results for some of the smaller models: Llama 3.2 3b | Platform |...
New benchmarks with improved performance for long-context generation! #933 was just merged with some optimizations based on ml-explore/mlx#1597! Current benchmarks with Llama 3.1 8b show us benefiting from the change...
New benchmarks! #1094 adds a small optimization of the sampling process for improved decoding efficiency. Llama 3.2 3b | Platform | TG (256) T/s | PP (256) T/s | |...
Here are some benchmarks of various models in preparation for our v0.5.0 release! ## Llama 3.2 3b, 8bit | Platform | TG (256) T/s | PP (256) T/s | |...
@terhechte thanks for the issue. #1250 should fix this, can you please retry after `git pull`?