Tri Dao
Tri Dao
The github runner takes about 30mins to compile. https://github.com/state-spaces/mamba/actions/runs/12206882183/job/34057291390 Make sure you have `ninja` installed to parallelize the build.
The kernels are copy-pasted afaik.
Is `nvcc` installed?
Likely setup.py can't find the right path to `nvcc`. We rely on CUDA_HOME from `torch.utils.cpp_extension` What does `from torch.utils.cpp_extension import CUDA_HOME` give?
As the error message says ```CUDA_HOME environment variable is not set. Please set it to your CUDA install root```
I don't have experience on Windows. Cutlass 3.2 is supposed to work on Windows, but maybe we need to do more work on the FlashAttention side to enable Windows support....
Sorry I'm traveling this week but will have time to look into this next week.
The gradient should be converted to fp32 automatically if fwd was in fp32
I believe the bwd of attn_ref will first convert dO from bf16 to fp32 (since the last step of attn_ref is converting output from fp32 -> bf16). You should with...