flash-attention In the paged attention mode, whether the kcache space must be malloced in a continuous space?

In the paged attention mode, whether the kcache space must be malloced in a continuous space?

Open NengchaoPan opened this issue 11 months ago • 1 comments

I'm confused by the gmem malloced for the paged attention.

In paged attention mode, two K cache_block for the same query, do the gmem need to be malloced in continous space?

like nheads=1, headdim=128,page_block_size=64, let's say the address for one query's first cache-block is __half* p_block0,

and the address for the second block is __half* p_block1.

does p_block1 = p_block0 + page_block_size * headdim * nheads?

Mar 19 '24 13:03 NengchaoPan

Not sure I understand the question but the function docstring should tell you the shapes of the tensors and whether they need to be contiguous:

        k_cache: (batch_size_cache, seqlen_cache, nheads_k, headdim) if there's no block_table,
            or (num_blocks, page_block_size, nheads_k, headdim) if there's a block_table (i.e. paged KV cache)
            page_block_size must be a multiple of 256.
        v_cache: (batch_size_cache, seqlen_cache, nheads_k, headdim) if there's no block_table,
            or (num_blocks, page_block_size, nheads_k, headdim) if there's a block_table (i.e. paged KV cache)
        block_table [optional]: (batch_size, max_num_blocks_per_seq), dtype torch.int32.

Mar 19 '24 19:03 tridao

flash-attention flash-attention copied to clipboard

In the paged attention mode, whether the kcache space must be malloced in a continuous space?

flash-attention
flash-attention copied to clipboard