Differences between ETC and BigBird-ETC version
@manzilz Thank you for sharing the excellent research. :)
I have two quick questions. If I missed some info in your paper, could you please let me know what I missed?
Q1. Is the Global-local attention method used in the BigBird-ETC version totally the same as the ETC paper, otherwise Longformer?
As I know, some special tokens(global tokens) only take full attention to the restricted sequences according to the ETC paper. For example, in the HotpotQA task, a paragraph token attends to all tokens within the paragraph. Also, a sentence token attends to all tokens within the sentence. ( I can't find about how [CLS] and question tokens take attention to. )
In Longformer, the special tokens between sentences take full attention to the context.
In BigBird paper(above of section 3), the author said
"we add g global tokens that attend to all existing tokens."
It seems to say the BigBird-ETC version is similar to Longformer. However, when the author mentioned differences between Longformer and BigBird-ETC, point to the reference as an ETC (in Appendix E.3). It makes me confused.
Q2. Is there a source code or a pre-trained model for the BigBird-ETC version? If you could share it used in your paper, I will really appreciate it!
I look forward to your response.