Tri Dao

Results 429 comments of Tri Dao
trafficstars

Thanks so much @Hprairie! Let me take a careful look

@Hprairie is it simpler to set `ddA_cs_ptrs` to `hi` after the loop? Then we don't need to worry about which block M & block N would work?

Oh i was gonna do the simple thing. Instead of ``` for start_n in range(hi, chunk_size, BLOCK_SIZE_N): tl.store(ddAcs_ptrs + stride_ddA_cs_csize_n, tl.zeros((BLOCK_SIZE_N,), dtype=tl.float32), mask=offs_n < chunk_size - start_n - 1) ddAcs_ptrs...

arm64 wheels are already out with v4.1

https://pytorch.org/tutorials/recipes/recipes/benchmark.html Please dont use time.time()

> In the meantime, this also suggests that an immediate workaround is to build the partitioner with a very similar TMEM tensor and then use it on your actual TMEM...

I'm not familiar with FSDP, can you post a short script to replicate? Is the issue just activation checkpointing? Or is FSDP relevant?

idk anything about L20

looks like an Ampere card, not Hopper. So no.