Tri Dao comments

Results 429 comments of


                                            Tri Dao

trafficstars

torch.compile(fullgraph=True) support for flash-decoding

I've been trying to make this work, but I'm not experienced with torch.compile. If you figure sth out please let me know.

torch.compile(fullgraph=True) support for flash-decoding

Thanks for the investigation. So right now sounds like it's hard to do in-place ops with torch.compile.

torch.compile(fullgraph=True) support for flash-decoding

Sure, can you point me to how they do it?

Add support for Cuda 12.8 and B200 GPUs

There's plan but it'll take a while.

Add support for Cuda 12.8 and B200 GPUs

No we don't commit to a public timeline. It really depends on how much folks are contributing their time

Add support for Cuda 12.8 and B200 GPUs

We're working on Blackwell

Add support for Cuda 12.8 and B200 GPUs

It's coming along. Meanwhile you can use either cuDNN or the cutlass implementation: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha

Add support for Cuda 12.8 and B200 GPUs

Yes

Add support for Cuda 12.8 and B200 GPUs

We're building on the cute-dsl example here: https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/fmha.py If you'd like to help, you can start porting the backward pass from C++ to Cute-DSL: https://github.com/NVIDIA/cutlass/tree/main/examples/77_blackwell_fmha

Add support for Cuda 12.8 and B200 GPUs

Yes we plan to support aarch64 (because of GB200). Currently cute-dsl doesn't have a wheel on aarch64 yet (only x86_64) but that will be fixed soon