tokenizers Unknown token missing from BPE tokenizer trained with BPETrainer

Unknown token missing from BPE tokenizer trained with BPETrainer

Open abhishekkrthakur opened this issue 4 years ago • 2 comments

I took a look at the short tutorial (https://huggingface.co/transformers/tokenizer_summary.html) and tried to replicate it (the BPE part of it).

My initial corpus looks like this:

hug
hug
hug
hug
hug
hug
hug
hug
hug
hug
pug
pug
pug
pug
pug
pun
pun
pun
pun
pun
pun
pun
pun
pun
pun
pun
pun
bun
bun
bun
bun
hugs
hugs
hugs
hugs
hugs

Which is same as ('hug', 10), ('pug', 5), ('pun', 12), ('bun', 4), ('hugs', 5) as presented in the tutorial.

Then I train the BPE tokenizer as follows:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(vocab_size=10)
tokenizer.train(trainer, ["test.txt"])

and test it on words given in the example: bug and mug

In [12]: tokenizer.encode("bug").tokens
Out[12]: ['b', 'ug']

In [13]: tokenizer.encode("mug").tokens
Out[13]: ['ug']

Shouldn't it be ['UNK', 'ug'] ?

Another example:

In [17]: tokenizer.encode("mug 😀").tokens
Out[17]: ['ug']

Is this a bug?

Nov 10 '20 13:11 abhishekkrthakur

Seems 100% like a bug to me.

The behaviour should be either raising an exception (because no unk token was defined) or returning [UNK] as you expect.

I opened a PR that:

Would make your particular example fail (no unk_token defined, we don't define it by default) by raising Unk token was not defined but should be used to encode this string
You can fix by doing trainer = BpeTrainer(vocab_size=10, unk_token="[UNK]")

@n1t0 what's your opinion on this ?

Nov 10 '20 13:11 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 08 '24 01:05 github-actions[bot]

tokenizers tokenizers copied to clipboard

Unknown token missing from BPE tokenizer trained with BPETrainer

tokenizers
tokenizers copied to clipboard