memorizing-transformers-pytorch
memorizing-transformers-pytorch copied to clipboard
Maybe scale is wrong
https://github.com/lucidrains/memorizing-transformers-pytorch/blob/83fa1479d6f7881dd977fbff55681e709e3b250e/memorizing_transformers_pytorch/memorizing_transformers_pytorch.py#L237
Shouldn't this be (1-scale)?
ohh no, that is actually the learned temperature from a variant of attention (cosine similarity attention) https://github.com/lucidrains/x-transformers#query-key-normalization the temperature is in log space, exponentiated here https://github.com/lucidrains/memorizing-transformers-pytorch/blob/main/memorizing_transformers_pytorch/memorizing_transformers_pytorch.py#L235
@denadai2 ohh if you were looking for the sigmoid gating, i removed that, since it was not working well for me and another researcher (thought that was one of the weak parts of the paper). i went with the other researcher's suggestion of attending across the similarities, local and distant (softmax across the attention logits concatted)
thanks for the prompt answer! I saw it now :)
btw this increases the complexity I'd say... it makes sense though