openvino icon indicating copy to clipboard operation
openvino copied to clipboard

[CPU] PagedAttention supports dynamic-split fuse

Open luo-cheng2021 opened this issue 10 months ago • 0 comments

Details:

  • Merge first token and second token inference into one parallel loop
  • ~~Additional optimization: pre-transpose k-cache, pre-pack v-cache if needed~~
  • Additional optimization for first token: save q * k' upper triangle matrix computation and (q * k') * v lower triangle matrix computation
  • C++ pipeline can enable it: https://github.com/ilya-lavrenov/openvino.genai/pull/4
  • TODO(in another PR):
    • alibi support
    • performance tuning
    • testcase

Tickets:

luo-cheng2021 avatar Apr 18 '24 08:04 luo-cheng2021