xlnet Why choosing two segments from different contexts during pre-training?

Hi,

I have a question regarding the calculation of self-attention.

In the paper, you state that prior to pre-training, following BERT, we randomly sample two segments (either from the same context or not) and treat the concatenation of two segments as one sequence to perform permutation language modelling. We only reuse the memory that belongs to the same context. And the input takes the form as [A, SEP, B, SEP, CLS].

It seems to me that self-attention is calculated within one sequence whether segment A and segment B are from the same context or not. I can't figure out the reason why you choose A and B from different contexts (at 50% of the time) if there is no NSP during pre-training. Will this potentially degrade the performance of the language model on predicting masked tokens?

Thank you!

Jul 26 '19 10:07 Dzautriet

@kimiyoung Just commenting for visibility, I'm also really curious about this.

Jul 29 '19 02:07 langfield

also confused about this Q.

Jul 29 '19 05:07 xingchensong

Here's my take on the problem. While both segments share one set of in-sequence self-attention, only one of them gets to access the contextual embeddings stored inside the memory, which was obtained from yet another sequence earlier. Apart from the obvious dropout effect on memory(and the possible generalization benefit), I can think of one use case that this feature may prove to be useful: While you may have 2 very long sequences that you wanted to make coherent decisions with, both of them have length far exceeding the sequence limit, therefore is not possible to feed any of them to the model in one pass. In this case, one may leverage the eXtra-Long memory and partial memory access to build 2 sets of contextual embeddings A and B that:

somewhat represents those original long sequences, and at the same time
attends to each other

Aug 14 '19 14:08 illuminascent

I'm also confused about the same problem.

The only explanation I can think of is that, at first, the authors tried to do NSP during pre-training. But the authors abandoned NSP as a training objective, because ablation studies showed that NSP wasn't worth it. Refer to section 2.5 of the paper:

XLNet-Large does not use the objective of next sentence prediction as it does not show consistent improvement in our ablation study.

Still, I'm confused on two things:

(1) So, does that mean the authors used NSP for pre-training the XLNet-base model?

(2) If they abandoned NSP entirely, why are they using two-segment data format?

Aug 23 '19 07:08 jihun-hong

xlnet xlnet copied to clipboard

Why choosing two segments from different contexts during pre-training?

xlnet
xlnet copied to clipboard