Fangzhou Dong comments

Results 5 comments of


Fangzhou Dong

Getting huge number of training steps

I think your calculation is correct. The original ALBERT was trained using batch size 4096(as specified in their paper), that is the reason behind using LAMB and why it only...

Getting huge number of training steps

@008karan If you haven't done full shuffle on your data -> yes. Otherwise any subset of the training dataset shall represent the whole set well enough and its perfectly fine...

Is xlnet indeed context aware?

FYI I tried several ways to construct a sentence embedding given text input and hidden outputs. They all turned out to be surprisingly similar in cosine similarity(just like the result...

AssertionError in pretrain an XLNet . train_gpu.py.

You are getting this error because this assertion is not implemented properly. bsz in relative_positional_encoding is inferred from the shape of the input, this makes it a tensor. And %...

Why choosing two segments from different contexts during pre-training?

Here's my take on the problem. While both segments share one set of in-sequence self-attention, only one of them gets to access the contextual embeddings stored inside the memory, which...