course icon indicating copy to clipboard operation
course copied to clipboard

Tokenization Course Issues

Open KeremTurgutlu opened this issue 2 years ago • 4 comments

Hello,

I believe the corpus and the word_freqs output used in the BPE / WordPiece implementations have a mismatch simply Course -> course is not capitalized in corpus but word_freqs seem to use the capitalized version.

To reproduce

corpus = [
    "This is the Hugging Face course.",
    "This chapter is about tokenization.",
    "This section shows several tokenizer algorithms.",
    "Hopefully, you will be able to understand how they are trained and generate tokens.",
]

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

word_freqs = defaultdict(int)
for text in corpus:
    words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
    words = [word for word, _ in words_with_offsets]
    for word in words:
        word_freqs[word] += 1

assert word_freqs == defaultdict(int, {'This': 3, 'is': 2, 'the': 1, 'Hugging': 1, 'Face': 1, 'Course': 1, '.': 4, 'chapter': 1, 'about': 1,
    'tokenization': 1, 'section': 1, 'shows': 1, 'several': 1, 'tokenizer': 1, 'algorithms': 1, 'Hopefully': 1,
    ',': 1, 'you': 1, 'will': 1, 'be': 1, 'able': 1, 'to': 1, 'understand': 1, 'how': 1, 'they': 1, 'are': 1,
    'trained': 1, 'and': 1, 'generate': 1, 'tokens': 1})

KeremTurgutlu avatar Apr 14 '22 19:04 KeremTurgutlu

In WordPiece if you go to line where we train the tokenizer and print the learned vocab:

print(vocab)

vocab from this print statement is missing the merge ab and has 69 merges, although vocab_size is set to 70.

KeremTurgutlu avatar Apr 15 '22 03:04 KeremTurgutlu

Same typo Course -> course is also present in Unigram. Final tokenizations assumes capital Course is used and results in ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁', 'c', 'ou', 'r', 's', 'e', '.']. However if lowercased course is used then the tokenization would be ['▁This', '▁is', '▁the', '▁Hugging', '▁Face', '▁course.']

KeremTurgutlu avatar Apr 15 '22 17:04 KeremTurgutlu

Thanks for reporting these typos @KeremTurgutlu - you're totally right that the capitalization isn't applied consistently. I think the simplest change would be to capitalise Course in the corpus list - would you like to open a PR with the fixes?

lewtun avatar Apr 20 '22 13:04 lewtun

@lewtun created https://github.com/huggingface/course/pull/166

KeremTurgutlu avatar May 07 '22 04:05 KeremTurgutlu