transformers
transformers copied to clipboard
Something confused me about Nystromformer
System Info
Any GPU Machine with Transformer 4.26.0
Who can help?
@ArthurZucker @younesbelkada @sgugger @novice03
Information
- [x] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
The parameter segment-means-seq-len
in Nystromformer config is set to be 64, and is equal to another parameter num_landmarks (64). refer to code
But if they are equal, the Nystromformer will perform a O(n2) attention like bert, not the nystrom-attention proposed in original paper: https://arxiv.org/abs/2102.03902. refer to code
Through experimentation and anslysis, I think the parameter segment-means-seq-len
should be the length of the tokenized input sequence. It should not be set to 64, if you set to be 64,It means you wanna use O(n2) attention,not nystrom attention.
So, there is a problem with the code, or is my understanding wrong? Addtionally, whether the model weight w-madison/nystromformer-5 is trained with O(n2)? if yes, whether the modlel weight will not run with nystrom-attention, it need to be pretrain with nystrom-attention?
Expected behavior
The parameter segment-means-seq-len
is set to be the real tokenized sequence length, so the nystrom-attention can be used to train or inference.
I'll ping @novice03 here as he's an expert on Nyströmformer
Hello @1649759610, thank you for making this post. It looks like this is indeed an issue in the code that I might have overlooked. You are correct that segment-means-seq-len
should be set to the length of the input. If I were to fix it, I would just remove the segment-means-seq-len
parameter and set self.seq_len
in the model to config.max_position_embeddings
. I think a pull request would have to be made to make these changes. I am also guessing that the tests need to be changed accordingly.
However, regarding the checkpoints, they were trained with Nystrom attention and not O(n^2) attention. This is just an issue in the HuggingFace implementation. So, they need not be re-trained.
@1649759610 Would you like to make a PR about this?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.