openvino
openvino copied to clipboard
[CPU] PagedAttention supports dynamic-split fuse
Details:
- Merge first token and second token inference into one parallel loop
- ~~Additional optimization: pre-transpose k-cache, pre-pack v-cache if needed~~
- Additional optimization for first token: save q * k' upper triangle matrix computation and (q * k') * v lower triangle matrix computation
- C++ pipeline can enable it: https://github.com/ilya-lavrenov/openvino.genai/pull/4
-
TODO(in another PR):
- alibi support
- performance tuning
- testcase