course
course copied to clipboard
Mistake in Chapter 7, Fine-tuning a masked language model, whole_word_masking_data_collator function?
Hello,
I've been following the excellent tutorial on fine-tuning a language model for my custom dataset.
In the definition of the whole_word_masking_data_collator
function, I notice that the new_labels
variable is initialized but not assigned to the feature["label"]
field. Should this be added?
def whole_word_masking_data_collator(features):
for feature in features:
word_ids = feature.pop("word_ids")
# Create a map between words and corresponding token indices
mapping = collections.defaultdict(list)
current_word_index = -1
current_word = None
for idx, word_id in enumerate(word_ids):
if word_id is not None:
if word_id != current_word:
current_word = word_id
current_word_index += 1
mapping[current_word_index].append(idx)
# Randomly mask words
mask = np.random.binomial(1, wwm_probability, (len(mapping),))
input_ids = feature["input_ids"]
labels = feature["labels"]
new_labels = [-100] * len(labels)
for word_id in np.where(mask)[0]:
word_id = word_id.item()
for idx in mapping[word_id]:
new_labels[idx] = labels[idx]
input_ids[idx] = tokenizer.mask_token_id
+ feature["labels"] = new_labels
return default_data_collator(features)