Fangzhou Dong

Results 5 comments of Fangzhou Dong

I think your calculation is correct. The original ALBERT was trained using batch size 4096(as specified in their paper), that is the reason behind using LAMB and why it only...

@008karan If you haven't done full shuffle on your data -> yes. Otherwise any subset of the training dataset shall represent the whole set well enough and its perfectly fine...

FYI I tried several ways to construct a sentence embedding given text input and hidden outputs. They all turned out to be surprisingly similar in cosine similarity(just like the result...

You are getting this error because this assertion is not implemented properly. bsz in relative_positional_encoding is inferred from the shape of the input, this makes it a tensor. And %...

Here's my take on the problem. While both segments share one set of in-sequence self-attention, only one of them gets to access the contextual embeddings stored inside the memory, which...