Tri Dao
Tri Dao
Can you save the tensors being passed to flash_attn_cuda.varlen_bwd and send them to me? Otherwise it would be very hard to debug? And can you print out the value of...
Would be hard for me to debug if I can't reproduce it. You can do try catch to hopefully save the tensors.
I haven't had much bandwidth to work on Turing.
I see. I'll try to find some time this weekend for this. Is the usage on T4 just inference (forward pass only)?
> Hi, has there been any update on this? No I haven't had much time
Nope I've had no bandwidth
Please benchmark just the attention operation
Try flash-attn 2.5.1 on nvcr 23.12 or 24.01.
Can you try `python -m pip install flash-attn`? It's possible that `pip` and `python -m pip` refer to different environments. Getting the dependencies right for all setup is hard. We...
I don't know a right solution that works for all setups, happy to hear suggestions. We recommend the [Pytorch](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) container from Nvidia, which has all the required tools to install...