Tri Dao comments

Results 429 comments of


                                            Tri Dao

trafficstars

Fix Incorrect Gradients and Illegal Memory Access Error in Mamba2

Thanks so much @Hprairie! Let me take a careful look

Fix Incorrect Gradients and Illegal Memory Access Error in Mamba2

@Hprairie is it simpler to set `ddA_cs_ptrs` to `hi` after the loop? Then we don't need to worry about which block M & block N would work?

Fix Incorrect Gradients and Illegal Memory Access Error in Mamba2

Oh i was gonna do the simple thing. Instead of ``` for start_n in range(hi, chunk_size, BLOCK_SIZE_N): tl.store(ddAcs_ptrs + stride_ddA_cs_csize_n, tl.zeros((BLOCK_SIZE_N,), dtype=tl.float32), mask=offs_n < chunk_size - start_n - 1) ddAcs_ptrs...

[FEA] [CuteDSL] Request for arm64 wheel

arm64 wheels are already out with v4.1

layernorm/rmsnorm is slow

https://pytorch.org/tutorials/recipes/recipes/benchmark.html Please dont use time.time()

[QST] Got compilation error when compiling flash-attention-3 with CUDA 12.3

Does it work with CUDA 12.4 and above?

[BUG] Tmem tiled copy with non power-of-2 size fails to compile

> In the meantime, this also suggests that an immediate workaround is to build the partitioner with a very similar TMEM tensor and then use it on your actual TMEM...

flash-attention v2 with activation checkpointing (no_reentrant) raise Runtime Error

I'm not familiar with FSDP, can you post a short script to replicate? Is the issue just activation checkpointing? Or is FSDP relevant?

flash-attn3 supported L20?

idk anything about L20

flash-attn3 supported L20?

looks like an Ampere card, not Hopper. So no.