BERT-pytorch icon indicating copy to clipboard operation
BERT-pytorch copied to clipboard

when training the masked LM, the unmasked words (have label 0) were trained together with masked words?

Open coddinglxf opened this issue 7 years ago • 6 comments

According to the code

    def random_word(self, sentence):
        tokens = sentence.split()
        output_label = []

        for i, token in enumerate(tokens):
            prob = random.random()
            if prob < 0.15:
                # 80% randomly change token to make token
                if prob < prob * 0.8:
                    tokens[i] = self.vocab.mask_index

                # 10% randomly change token to random token
                elif prob * 0.8 <= prob < prob * 0.9:
                    tokens[i] = random.randrange(len(self.vocab))

                # 10% randomly change token to current token
                elif prob >= prob * 0.9:
                    tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

                output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

            else:
                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
                output_label.append(0)

        return tokens, output_label

Do we need to exclude the unmasked words when training the LM?

coddinglxf avatar Oct 23 '18 07:10 coddinglxf

@coddinglxf I just solved that problem with nn.NLLLoss(ignore_index=0) which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation

codertimo avatar Oct 23 '18 07:10 codertimo

got it,thanks. however,when batch size is large,the final LM output will be “batch_size * seq_len * vacab_size”, such matrix is too big. maybe we can record the index of masked works when preprocessing the text, and use index technology to save memory.

coddinglxf avatar Oct 23 '18 07:10 coddinglxf

@coddinglxf that's what I thought at first, but can't implement it efficiently as much as GPU computation time. If you have any idea please implement and pull request plez :) It would be really cool to do it 👍

codertimo avatar Oct 23 '18 07:10 codertimo

@codertimo the output of transformer block will be : (batch, seq, hidden)

step 1: we can reshape the transformer output as : (batch*seq, hidden)

step 2: index the masked words for 0-th sentence, the i-th masked word will have a index: 0 * seq + i for j-th sentence, the i-th masked word will have a index: j * seq + i

combine step1 and step2: we can get the hidden output of masked words (using fancy indexing)

the question is we can not do this with multi-gpu

coddinglxf avatar Oct 23 '18 08:10 coddinglxf

@coddinglxf I just solved that problem with nn.NLLLoss(ignore_index=0) which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation

why here output_label is 0, but not "tokens[i]"? if we set output_label =0, that means, just 15% of training data is used to train MaskedLM? tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) output_label.append(0)

leon-cas avatar Oct 26 '18 08:10 leon-cas

@leon-cas yes #36 it's solved with your question

codertimo avatar Oct 30 '18 07:10 codertimo