crabml
crabml copied to clipboard
only f32 is ok. make a quick & naive implementation to pass the generation at first.
- [x] #186 - [x] #187 - [x] #188 - [ ] #216
https://huggingface.co/apple/OpenELM-270M-Instruct
mistral's context window is longer than the kv cache with the help of sliding window attention. we can make kv cache a ring buffer, so that we can keep chat...
as described in https://arxiv.org/pdf/2309.16609.pdf the architectural differences between llama are:  references: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/models/qwen.py
Any speed testment?