Tri Dao

Results 250 comments of Tri Dao

I just haven't had time to review and merge it (it's a pretty big change). Still trying to figure out a good way to support both mask and bias without...

Usually that's because of some mismatch between pytorch cuda version and the nvcc version used to compiled FlashAttention. If you're certain there's no mismatch then idk what could be wrong....

We recommend using the Nvidia's pytorch [container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch), which has all the right environment setup.

Can you try the latest version (1.0.6)?

We should have prebuilt wheels for this setting (torch 2.0 cuda 11.8) that setup.py automatically downloads, and nvcc should not be necessary. Are you installing from source or from PyPI...

I see. The current setup.py might still require nvcc, I'll figure out how to fix later. As a work around for now you can try `FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn --no-build-isolation`

I turned the "raise error" to a warning but looks like it's not enough. Constructing the CUDAExtension with pytorch already requires CUDA_HOME. Let me think about it more.

Maybe the function you're looking for is block_diag_butterfly_project_einsum_rank. (you can see our tests here that the projection recovers the original factors) https://github.com/HazyResearch/fly/blob/cd624cffeffa7d1579336d26a776405bf0867f36/tests/ops/test_blockdiag_butterfly_einsum.py#L112