flashinfer icon indicating copy to clipboard operation
flashinfer copied to clipboard

[FEAT REQ][CUDA GRAPH] Allow explicit control flag to force enable/disable split KV

Open AgrawalAmey opened this issue 7 months ago • 2 comments

Hello @yzh119,

Currently, we are using two independent API calls for prefill and decode in a mixed batch setting. This makes defining a cuda graph layout considerably harder. Ideally, if we could do both prefill and decode attention computation in prefill kernel it would considerably simplify the cuda graph layout. However, the main barrier for doing this right now is that we don't have an explicit control over when to use split-KV. In case of mixed batches, it appears that doing split-KV is beneficial in most cases. But it appears that split-KV gets disabled in certain batch composition, which significantly hurts latency. Would it be possible to add an optional override knob for this? Thanks!

AgrawalAmey avatar Jul 26 '24 00:07 AgrawalAmey