course
course copied to clipboard
Tokenization Course Issues
Hello,
I believe the corpus and the word_freqs
output used in the BPE / WordPiece implementations have a mismatch simply Course -> course
is not capitalized in corpus but word_freqs
seem to use the capitalized version.
To reproduce
corpus = [
"This is the Hugging Face course.",
"This chapter is about tokenization.",
"This section shows several tokenizer algorithms.",
"Hopefully, you will be able to understand how they are trained and generate tokens.",
]
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
word_freqs = defaultdict(int)
for text in corpus:
words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
words = [word for word, _ in words_with_offsets]
for word in words:
word_freqs[word] += 1
assert word_freqs == defaultdict(int, {'This': 3, 'is': 2, 'the': 1, 'Hugging': 1, 'Face': 1, 'Course': 1, '.': 4, 'chapter': 1, 'about': 1,
'tokenization': 1, 'section': 1, 'shows': 1, 'several': 1, 'tokenizer': 1, 'algorithms': 1, 'Hopefully': 1,
',': 1, 'you': 1, 'will': 1, 'be': 1, 'able': 1, 'to': 1, 'understand': 1, 'how': 1, 'they': 1, 'are': 1,
'trained': 1, 'and': 1, 'generate': 1, 'tokens': 1})
In WordPiece if you go to line where we train the tokenizer and print the learned vocab:
print(vocab)
vocab from this print statement is missing the merge ab
and has 69 merges, although vocab_size is set to 70.
Same typo Course -> course
is also present in Unigram. Final tokenizations assumes capital Course
is used and results in ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁', 'c', 'ou', 'r', 's', 'e', '.']
. However if lowercased course
is used then the tokenization would be ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁course.']
Thanks for reporting these typos @KeremTurgutlu - you're totally right that the capitalization isn't applied consistently. I think the simplest change would be to capitalise Course
in the corpus
list - would you like to open a PR with the fixes?
@lewtun created https://github.com/huggingface/course/pull/166