Phil Wang comments

Results 814 comments of


Phil Wang

Support for self attention guidance

will definitely try this out this week, and if it pans out, abstract this into a framework so one can try guidance on signals other than the attention map

[Feature request] DiVAE

@jordiae i think SOTA for diffusion transformers would be [Muse](https://github.com/lucidrains/muse-maskgit-pytorch) i'll take a look at DiVAE this weekend, thanks!

> > @jordiae i think SOTA for diffusion transformers would be [Muse](https://github.com/lucidrains/muse-maskgit-pytorch) > > i'll take a look at DiVAE this weekend, thanks! > > The main difference is that...

Error when using `sync_codebook=True` with `torch.compile`

yea that is on them to fix

Implement/Integrate Flash-Attention

it is done https://github.com/lucidrains/x-transformers#flash-attention

Can the continous transformer autoregressive wrapper help with pre-training on time-series data?

@Espritdelescalier https://arxiv.org/abs/2211.14730

How to make inference fast (by adding caching of key / values)

turns out you can actually go a bit faster https://crfm.stanford.edu/2023/10/12/flashdecoding.html but it requires that you are one of the CUDA experts out there

How to make inference fast (by adding caching of key / values)

anyways, closing this as caching of key/values have been implemented!

Making this work with relative position bias from XTransformers

@pfeatherstone if you are working with 1d sequences, the best approach would be https://github.com/lucidrains/x-transformers#dynamic-positional-bias, which is `O(n)` the other alternative is ALiBi positional embedding, which needs only to be materialized...

Making this work with relative position bias from XTransformers

@pfeatherstone which module are you using from this repository? you should be using the CUDA implementation from [here](https://github.com/hazyResearch/flash-attention)