course icon indicating copy to clipboard operation
course copied to clipboard

Mistake in Chapter 7, Fine-tuning a masked language model, whole_word_masking_data_collator function?

Open Priya22 opened this issue 1 year ago • 0 comments

Hello,

I've been following the excellent tutorial on fine-tuning a language model for my custom dataset. In the definition of the whole_word_masking_data_collator function, I notice that the new_labels variable is initialized but not assigned to the feature["label"] field. Should this be added?

def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
+      feature["labels"] = new_labels
    return default_data_collator(features)

Priya22 avatar Aug 08 '22 08:08 Priya22