Phil Wang

Results 814 comments of Phil Wang

will definitely try this out this week, and if it pans out, abstract this into a framework so one can try guidance on signals other than the attention map

@jordiae i think SOTA for diffusion transformers would be [Muse](https://github.com/lucidrains/muse-maskgit-pytorch) i'll take a look at DiVAE this weekend, thanks!

> > @jordiae i think SOTA for diffusion transformers would be [Muse](https://github.com/lucidrains/muse-maskgit-pytorch) > > i'll take a look at DiVAE this weekend, thanks! > > The main difference is that...

it is done https://github.com/lucidrains/x-transformers#flash-attention

turns out you can actually go a bit faster https://crfm.stanford.edu/2023/10/12/flashdecoding.html but it requires that you are one of the CUDA experts out there

anyways, closing this as caching of key/values have been implemented!

@pfeatherstone if you are working with 1d sequences, the best approach would be https://github.com/lucidrains/x-transformers#dynamic-positional-bias, which is `O(n)` the other alternative is ALiBi positional embedding, which needs only to be materialized...

@pfeatherstone which module are you using from this repository? you should be using the CUDA implementation from [here](https://github.com/hazyResearch/flash-attention)