tokenizers
tokenizers copied to clipboard
Unknown token missing from BPE tokenizer trained with BPETrainer
I took a look at the short tutorial (https://huggingface.co/transformers/tokenizer_summary.html) and tried to replicate it (the BPE part of it).
My initial corpus looks like this:
hug
hug
hug
hug
hug
hug
hug
hug
hug
hug
pug
pug
pug
pug
pug
pun
pun
pun
pun
pun
pun
pun
pun
pun
pun
pun
pun
bun
bun
bun
bun
hugs
hugs
hugs
hugs
hugs
Which is same as ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5)
as presented in the tutorial.
Then I train the BPE tokenizer as follows:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(vocab_size=10)
tokenizer.train(trainer, ["test.txt"])
and test it on words given in the example: bug and mug
In [12]: tokenizer.encode("bug").tokens
Out[12]: ['b', 'ug']
In [13]: tokenizer.encode("mug").tokens
Out[13]: ['ug']
Shouldn't it be ['UNK', 'ug']
?
Another example:
In [17]: tokenizer.encode("mug 😀").tokens
Out[17]: ['ug']
Is this a bug?
Seems 100% like a bug to me.
The behaviour should be either raising an exception (because no unk token was defined) or returning [UNK]
as you expect.
I opened a PR that:
- Would make your particular example fail (no unk_token defined, we don't define it by default) by raising
Unk token was not defined but should be used to encode this string
- You can fix by doing
trainer = BpeTrainer(vocab_size=10, unk_token="[UNK]")
@n1t0 what's your opinion on this ?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.