Tri Dao

Results 440 comments of Tri Dao
trafficstars

I'm very curious about this. I think all the of values in `dq`, `dk`, `dv` should overwritten during the execution of the backward pass. The only problematic scenario I could...

Does the same thing happen if you use standard implementation of attention? i.e. try `use_flash_attn=False`

The key difficulty is someone needs to implement it :D

You should - Install torch 2.4 (if that's the verison you want) - Install flash-attn (latest version 2.7.0.post2 should work)

Idk tbh. Result was wrong without NoSwizzle.

Ha this is great to know, thank you!