Tri Dao comments

Results 454 comments of


                                            Tri Dao

trafficstars

Unable to directly compute parameters and FLOPs for Mamba models due to Triton and CUDA implementations

The model parameters are still defined in pytorch so it's just `sum(p.numel() for p in model.parameters())`. For FLOPS you can calculate by hand, or search the issues on this repo.

[QST] WIP int8 quantized impl but can't get transposed LDSM to work with current layout

Right LDSM won't work for V if the data is 8bit. We might have some way to address this soon.

[QST] WIP int8 quantized impl but can't get transposed LDSM to work with current layout

soon

[QST] WIP int8 quantized impl but can't get transposed LDSM to work with current layout

you can use LDSM.T and byte-permute, then LDSM, as a way to transpose V we'll release that code soon idk if it works well without warp specialization

[QST] WIP int8 quantized impl but can't get transposed LDSM to work with current layout

Sorry i mean LDSM.T, byte permute, then store using STSM. That way you can transpose V.

[QST] WIP int8 quantized impl but can't get transposed LDSM to work with current layout

I see, I forgot that STSM is Hopper only. The other option is to transpose V in a separate kernel, or fused it with a preceding kernel (e.g. gemm).

X,B,C correspond to Q,K,V or to V,K,Q

X V B K C Q

Question regarding overlapping

The figure is not drawn to scale, it's just an illustration. The way we do it, softmax only has 1 MUFU (exponential). There's no floating point division. Division is done...

Question regarding overlapping

For that kind of profiling you'd need to record the global clock, store to global memory, then visualize it later. It's quite manually. Triton has a profiler (Proton) that does...

Question about the paper v2: How to parallelize along the sequence length ?

Yes. The original FlashAttention implementation (May 2022) didn't have any seqlen parallelism. Later on (in code v1) we have a kind of parallelism in the forward pass where we decide...