Tri Dao
Tri Dao
I'm very curious about this. I think all the of values in `dq`, `dk`, `dv` should overwritten during the execution of the backward pass. The only problematic scenario I could...
Does the same thing happen if you use standard implementation of attention? i.e. try `use_flash_attn=False`
Not yet for now
The key difficulty is someone needs to implement it :D
You should - Install torch 2.4 (if that's the verison you want) - Install flash-attn (latest version 2.7.0.post2 should work)
Yeah i think the transformers version is the issue.
can you create a PR?
Wow this is great work!
Idk tbh. Result was wrong without NoSwizzle.
Ha this is great to know, thank you!