BERT-pytorch
BERT-pytorch copied to clipboard
when training the masked LM, the unmasked words (have label 0) were trained together with masked words?
According to the code
def random_word(self, sentence):
tokens = sentence.split()
output_label = []
for i, token in enumerate(tokens):
prob = random.random()
if prob < 0.15:
# 80% randomly change token to make token
if prob < prob * 0.8:
tokens[i] = self.vocab.mask_index
# 10% randomly change token to random token
elif prob * 0.8 <= prob < prob * 0.9:
tokens[i] = random.randrange(len(self.vocab))
# 10% randomly change token to current token
elif prob >= prob * 0.9:
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))
else:
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
output_label.append(0)
return tokens, output_label
Do we need to exclude the unmasked words when training the LM?
@coddinglxf I just solved that problem with nn.NLLLoss(ignore_index=0) which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation
got it,thanks. however,when batch size is large,the final LM output will be “batch_size * seq_len * vacab_size”, such matrix is too big. maybe we can record the index of masked works when preprocessing the text, and use index technology to save memory.
@coddinglxf that's what I thought at first, but can't implement it efficiently as much as GPU computation time. If you have any idea please implement and pull request plez :) It would be really cool to do it 👍
@codertimo the output of transformer block will be : (batch, seq, hidden)
step 1: we can reshape the transformer output as : (batch*seq, hidden)
step 2: index the masked words for 0-th sentence, the i-th masked word will have a index: 0 * seq + i for j-th sentence, the i-th masked word will have a index: j * seq + i
combine step1 and step2: we can get the hidden output of masked words (using fancy indexing)
the question is we can not do this with multi-gpu
@coddinglxf I just solved that problem with
nn.NLLLoss(ignore_index=0)which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation
why here output_label is 0, but not "tokens[i]"? if we set output_label =0, that means, just 15% of training data is used to train MaskedLM?
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) output_label.append(0)
@leon-cas yes #36 it's solved with your question