openvino
openvino copied to clipboard

Published 20 hours ago •

openvinotoolkit

Reame
Issues

[CPU] PagedAttention supports dynamic-split fuse

Open luo-cheng2021 opened this issue 10 months ago • 0 comments

Details:

Merge first token and second token inference into one parallel loop
~~Additional optimization: pre-transpose k-cache, pre-pack v-cache if needed~~
Additional optimization for first token: save q * k' upper triangle matrix computation and (q * k') * v lower triangle matrix computation
C++ pipeline can enable it: https://github.com/ilya-lavrenov/openvino.genai/pull/4
TODO(in another PR):
- alibi support
- performance tuning
- testcase

Tickets:

138673

Apr 18 '24 08:04 luo-cheng2021