BERT-pytorch when training the masked LM, the unmasked words (have label 0) were trained together with masked words?

when training the masked LM, the unmasked words (have label 0) were trained together with masked words?

Open coddinglxf opened this issue 7 years ago • 6 comments

According to the code

    def random_word(self, sentence):
        tokens = sentence.split()
        output_label = []

        for i, token in enumerate(tokens):
            prob = random.random()
            if prob < 0.15:
                # 80% randomly change token to make token
                if prob < prob * 0.8:
                    tokens[i] = self.vocab.mask_index

                # 10% randomly change token to random token
                elif prob * 0.8 <= prob < prob * 0.9:
                    tokens[i] = random.randrange(len(self.vocab))

                # 10% randomly change token to current token
                elif prob >= prob * 0.9:
                    tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

                output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

            else:
                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
                output_label.append(0)

        return tokens, output_label

Do we need to exclude the unmasked words when training the LM?

Oct 23 '18 07:10 coddinglxf

@coddinglxf I just solved that problem with nn.NLLLoss(ignore_index=0) which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation

Oct 23 '18 07:10 codertimo

got it，thanks. however，when batch size is large，the final LM output will be “batch_size * seq_len * vacab_size”, such matrix is too big. maybe we can record the index of masked works when preprocessing the text， and use index technology to save memory.

Oct 23 '18 07:10 coddinglxf

@coddinglxf that's what I thought at first, but can't implement it efficiently as much as GPU computation time. If you have any idea please implement and pull request plez :) It would be really cool to do it 👍

Oct 23 '18 07:10 codertimo

@codertimo the output of transformer block will be : (batch, seq, hidden)

step 1: we can reshape the transformer output as : (batch*seq, hidden)

step 2: index the masked words for 0-th sentence， the i-th masked word will have a index: 0 * seq + i for j-th sentence, the i-th masked word will have a index: j * seq + i

combine step1 and step2: we can get the hidden output of masked words (using fancy indexing)

the question is we can not do this with multi-gpu

Oct 23 '18 08:10 coddinglxf

@coddinglxf I just solved that problem with nn.NLLLoss(ignore_index=0) which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation

why here output_label is 0, but not "tokens[i]"? if we set output_label =0, that means, just 15% of training data is used to train MaskedLM? tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) output_label.append(0)

Oct 26 '18 08:10 leon-cas

@leon-cas yes #36 it's solved with your question

Oct 30 '18 07:10 codertimo

BERT-pytorch BERT-pytorch copied to clipboard

when training the masked LM, the unmasked words (have label 0) were trained together with masked words?

BERT-pytorch
BERT-pytorch copied to clipboard