Phil Wang
Phil Wang
will definitely try this out this week, and if it pans out, abstract this into a framework so one can try guidance on signals other than the attention map
@jordiae i think SOTA for diffusion transformers would be [Muse](https://github.com/lucidrains/muse-maskgit-pytorch) i'll take a look at DiVAE this weekend, thanks!
> > @jordiae i think SOTA for diffusion transformers would be [Muse](https://github.com/lucidrains/muse-maskgit-pytorch) > > i'll take a look at DiVAE this weekend, thanks! > > The main difference is that...
yea that is on them to fix
it is done https://github.com/lucidrains/x-transformers#flash-attention
@Espritdelescalier https://arxiv.org/abs/2211.14730
turns out you can actually go a bit faster https://crfm.stanford.edu/2023/10/12/flashdecoding.html but it requires that you are one of the CUDA experts out there
anyways, closing this as caching of key/values have been implemented!
@pfeatherstone if you are working with 1d sequences, the best approach would be https://github.com/lucidrains/x-transformers#dynamic-positional-bias, which is `O(n)` the other alternative is ALiBi positional embedding, which needs only to be materialized...
@pfeatherstone which module are you using from this repository? you should be using the CUDA implementation from [here](https://github.com/hazyResearch/flash-attention)