SimpleTheoryOfTypes comments

Results 5 comments of


                                            SimpleTheoryOfTypes

trafficstars

where is flash decoding second stage (reduce) code ?

What’s the best way to trigger the flash-decoding path when using `flash_fwd_splitkv_kernel(...)`? Is it correct to set num_splits = 0 and let the heuristics decide automatically? For flash-decoding, is the...

where is flash decoding second stage (reduce) code ?

Is it feasible to vectorize the `S=QK^T` and `SV` GEMMs along the batch dimension in flash decoding? For example, during decoding, the query q has a shape of [b, 1,...

where is flash decoding second stage (reduce) code ?

Given that, with small batch sizes, the attention kernel during decoding is memory-bound, why would maximizing SM utilization by creating more parallel work along the sequence dimension still lead to...

where is flash decoding second stage (reduce) code ?

Thanks a lot for the explanation! that makes sense, flash decoding also optimizes memory bandwidth by creating more parallel LD/ST instructions.

[QST] flash_attn2: why tOrVt is no swizzle ?

On FA 2.6.3, using sVt instead of sVtNoSwizzle generates correct results for my token decoding app (I'm using cutlass@19f515 - maybe this was a cutlass bug back then?). btw, even...