flash-attention
flash-attention copied to clipboard
In the paged attention mode, whether the kcache space must be malloced in a continuous space?
I'm confused by the gmem malloced for the paged attention.
In paged attention mode, two K cache_block for the same query, do the gmem need to be malloced in continous space?
like nheads=1, headdim=128,page_block_size=64, let's say the address for one query's first cache-block is __half* p_block0,
and the address for the second block is __half* p_block1.
does p_block1 = p_block0 + page_block_size * headdim * nheads?
Not sure I understand the question but the function docstring should tell you the shapes of the tensors and whether they need to be contiguous:
k_cache: (batch_size_cache, seqlen_cache, nheads_k, headdim) if there's no block_table,
or (num_blocks, page_block_size, nheads_k, headdim) if there's a block_table (i.e. paged KV cache)
page_block_size must be a multiple of 256.
v_cache: (batch_size_cache, seqlen_cache, nheads_k, headdim) if there's no block_table,
or (num_blocks, page_block_size, nheads_k, headdim) if there's a block_table (i.e. paged KV cache)
block_table [optional]: (batch_size, max_num_blocks_per_seq), dtype torch.int32.