quartic-transformer icon indicating copy to clipboard operation
quartic-transformer copied to clipboard

Exploring an idea where one forgets about efficiency and carries out attention across each edge of the nodes (tokens)

Quartic Transformer (wip)

Exploring an idea where one forgets about efficiency and carries out attention on each edge of the nodes (tokens). You can think of it as doing attention on the attention matrix, taking the perspective of the attention matrix as all the directed edges of a fully connected graph.

The hypothesis is that there is a task out there that the (sub)quartic transformer can do that quadratic transformers cannot.

Will also contain a modified implementation of multistream transformer (which is not quartic, but number of streams times the quadratic).

Appreciation

Install

$ pip install quartic-transformer

Usage

import torch
from quartic_transformer import QuarticTransformer

model = QuarticTransformer(
    num_tokens = 256,
    depth = 2,
    dim = 512,
    dim_edges = 32
)

tokens = torch.randint(0, 256, (1, 128))

logits = model(tokens) # (1, 128, 256)

Todo

  • [x] first add a weak taylor linear attention on top of all edges

  • [ ] use coordinate descent routing from the node attention matrix to select a subset of edges to update (and do full attention across)

  • [x] build multi-stream transformer, but allow exchange of information at the attention matrix, either through residual attention or a small edge-wise feedforward

Citation

@inproceedings{Keles2022OnTC,
    title   = {On The Computational Complexity of Self-Attention},
    author  = {Feyza Duman Keles and Pruthuvi Maheshakya Wijewardena and Chinmay Hegde},
    booktitle = {International Conference on Algorithmic Learning Theory},
    year    = {2022},
    url     = {https://api.semanticscholar.org/CorpusID:252198880}
}
@article{Burtsev2021MultiStreamT,
    title   = {Multi-Stream Transformers},
    author  = {Mikhail S. Burtsev and Anna Rumshisky},
    journal = {ArXiv},
    year    = {2021},
    volume  = {abs/2107.10342},
    url     = {https://api.semanticscholar.org/CorpusID:236171087}
}
@misc{Sutton,
    title  = {The Bitter Lesson},
    url    = {http://www.incompleteideas.net/IncIdeas/BitterLesson.html},
    author = {Sutton, Rich}
}
@article{Shazeer2020TalkingHeadsA,
    title   = {Talking-Heads Attention},
    author  = {Noam M. Shazeer and Zhenzhong Lan and Youlong Cheng and Nan Ding and Le Hou},
    journal = {ArXiv},
    year    = {2020},
    volume  = {abs/2003.02436},
    url     = {https://api.semanticscholar.org/CorpusID:212414717}
}