llama.cpp
llama.cpp copied to clipboard
llama: use sliding window for phi3
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [x] Low
- [ ] Medium
- [ ] High
Related issue report: #7709
This PR switches Phi3 model to use sliding window attention. After this PR, it no longer geneartes broken output after the 2,048 token. Tested on "phi3-mini-4k-instruct" model.
TODO: (DONE) ~~convert_hf_to_gguf.py
changes~~