course icon indicating copy to clipboard operation
course copied to clipboard

Imprecise description about removing token "pu" in section Unigram tokenization

Open yaojingguo opened this issue 4 months ago • 0 comments

https://huggingface.co/learn/nlp-course/chapter6/7?fw=pt says:

In this (very) particular case, we had two equivalent tokenizations of all the words: as we saw earlier, for example, "pug" could be tokenized ["p", "ug"] with the same score. Thus, removing the "pu" token from the vocabulary will give the exact same loss.

But as the following list from the link shows that "pun" needs "pu" and "n". If "pu" token is removed, the score for "pun" will change. So only if "pun" has the same score after "pu" is removed, the loss does not change.

"hug": ["hug"] (score 0.071428)
"pug": ["pu", "g"] (score 0.007710)
"pun": ["pu", "n"] (score 0.006168)
"bun": ["bu", "n"] (score 0.001451)
"hugs": ["hug", "s"] (score 0.001701)

yaojingguo avatar Apr 21 '24 15:04 yaojingguo