tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Add UNK token to a Unigram tokenizer created by giving vocabulary

Open marcmk6 opened this issue 2 years ago • 3 comments

Hi,

I want to use a Unigram tokenizer, instantiated by giving a vocabulary obtained elsewhere (e.g. google/sentencepiece). But I didn't figure out a way to add unk token to the tokenizer.

from tokenizers import Tokenizer
from tokenizers.models import Unigram

vocab = [('<unk>',0),('<sos>',0),('<eos>',0),('S',-2.37051),('T',-3.22602),('A',-3.52795),('R',-3.54177),('N',-3.57724),('D',-3.64957),('V',-3.75516),('G',-3.87228),('I',-3.98412),('L',-4.02885),('P',-4.06709),('H',-4.21636),('E',-4.26619),('K',-4.29312),('F',-4.55107),('Q',-4.58542),('M',-4.73073),('Y',-4.83201),('C',-5.02805),]

tokenizer = Tokenizer(Unigram(vocab=vocab))

seq = 'ADV'
_ = tokenizer.encode(seq) # works fine

oov_seq = '@DV'
_ = tokenizer.encode(oov_seq) # Exception: Encountered an unknown token but `unk_id` is missing

Is there any method to add an unk token to the tokenizer?

Thanks

marcmk6 avatar May 23 '22 06:05 marcmk6

Hi @Narsil,

Sorry to bother, but could you help?

marcmk6 avatar Jun 09 '22 09:06 marcmk6

@marcmk6

Sorry didn't see this the first time.

Try doing Unigram(vocab=vocab, unk_id=0) (It's either unk or unk_id I don't remember).

Bascially the vocab has no idea what is your unk, so Unigram needs to be told what is your unk. And since unk is not necessarily needed (for byte level for instance) then, it's not set when not supplied, which means the model does not know how to fill the ids.

Is that clearer ?

Narsil avatar Jun 09 '22 12:06 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Feb 16 '24 01:02 github-actions[bot]