memorizing-transformers-pytorch Maybe scale is wrong

Maybe scale is wrong

Open denadai2 opened this issue 3 years ago • 3 comments

https://github.com/lucidrains/memorizing-transformers-pytorch/blob/83fa1479d6f7881dd977fbff55681e709e3b250e/memorizing_transformers_pytorch/memorizing_transformers_pytorch.py#L237

Shouldn't this be (1-scale)?

May 20 '22 18:05 denadai2

ohh no, that is actually the learned temperature from a variant of attention (cosine similarity attention) https://github.com/lucidrains/x-transformers#query-key-normalization the temperature is in log space, exponentiated here https://github.com/lucidrains/memorizing-transformers-pytorch/blob/main/memorizing_transformers_pytorch/memorizing_transformers_pytorch.py#L235

May 20 '22 18:05 lucidrains

@denadai2 ohh if you were looking for the sigmoid gating, i removed that, since it was not working well for me and another researcher (thought that was one of the weak parts of the paper). i went with the other researcher's suggestion of attending across the similarities, local and distant (softmax across the attention logits concatted)

May 20 '22 18:05 lucidrains

thanks for the prompt answer! I saw it now :)

btw this increases the complexity I'd say... it makes sense though

May 20 '22 18:05 denadai2

memorizing-transformers-pytorch memorizing-transformers-pytorch copied to clipboard

Maybe scale is wrong

memorizing-transformers-pytorch
memorizing-transformers-pytorch copied to clipboard