Tri Dao
Tri Dao
Thanks so much @Hprairie! Let me take a careful look
@Hprairie is it simpler to set `ddA_cs_ptrs` to `hi` after the loop? Then we don't need to worry about which block M & block N would work?
Oh i was gonna do the simple thing. Instead of ``` for start_n in range(hi, chunk_size, BLOCK_SIZE_N): tl.store(ddAcs_ptrs + stride_ddA_cs_csize_n, tl.zeros((BLOCK_SIZE_N,), dtype=tl.float32), mask=offs_n < chunk_size - start_n - 1) ddAcs_ptrs...
arm64 wheels are already out with v4.1
https://pytorch.org/tutorials/recipes/recipes/benchmark.html Please dont use time.time()
Does it work with CUDA 12.4 and above?
> In the meantime, this also suggests that an immediate workaround is to build the partitioner with a very similar TMEM tensor and then use it on your actual TMEM...
I'm not familiar with FSDP, can you post a short script to replicate? Is the issue just activation checkpointing? Or is FSDP relevant?
idk anything about L20
looks like an Ampere card, not Hopper. So no.