transformers Something confused me about Nystromformer

System Info

Any GPU Machine with Transformer 4.26.0

Who can help?

@ArthurZucker @younesbelkada @sgugger @novice03

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

The parameter segment-means-seq-len in Nystromformer config is set to be 64, and is equal to another parameter num_landmarks (64). refer to code

But if they are equal, the Nystromformer will perform a O(n2) attention like bert, not the nystrom-attention proposed in original paper: https://arxiv.org/abs/2102.03902. refer to code

Through experimentation and anslysis, I think the parameter segment-means-seq-len should be the length of the tokenized input sequence. It should not be set to 64， if you set to be 64，It means you wanna use O(n2) attention，not nystrom attention.

So, there is a problem with the code, or is my understanding wrong? Addtionally, whether the model weight w-madison/nystromformer-5 is trained with O(n2)？ if yes, whether the modlel weight will not run with nystrom-attention, it need to be pretrain with nystrom-attention?

Expected behavior

The parameter segment-means-seq-len is set to be the real tokenized sequence length, so the nystrom-attention can be used to train or inference.

Feb 23 '23 10:02 1649759610

I'll ping @novice03 here as he's an expert on Nyströmformer

Feb 23 '23 18:02 NielsRogge

Hello @1649759610, thank you for making this post. It looks like this is indeed an issue in the code that I might have overlooked. You are correct that segment-means-seq-len should be set to the length of the input. If I were to fix it, I would just remove the segment-means-seq-len parameter and set self.seq_len in the model to config.max_position_embeddings. I think a pull request would have to be made to make these changes. I am also guessing that the tests need to be changed accordingly.

However, regarding the checkpoints, they were trained with Nystrom attention and not O(n^2) attention. This is just an issue in the HuggingFace implementation. So, they need not be re-trained.

Mar 05 '23 04:03 novice03

@1649759610 Would you like to make a PR about this?

Mar 19 '23 20:03 novice03

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 13 '23 15:04 github-actions[bot]

transformers transformers copied to clipboard

Something confused me about Nystromformer

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard