tokenizers
tokenizers copied to clipboard
Add UNK token to a Unigram tokenizer created by giving vocabulary
Hi,
I want to use a Unigram tokenizer, instantiated by giving a vocabulary obtained elsewhere (e.g. google/sentencepiece
). But I didn't figure out a way to add unk
token to the tokenizer.
from tokenizers import Tokenizer
from tokenizers.models import Unigram
vocab = [('<unk>',0),('<sos>',0),('<eos>',0),('S',-2.37051),('T',-3.22602),('A',-3.52795),('R',-3.54177),('N',-3.57724),('D',-3.64957),('V',-3.75516),('G',-3.87228),('I',-3.98412),('L',-4.02885),('P',-4.06709),('H',-4.21636),('E',-4.26619),('K',-4.29312),('F',-4.55107),('Q',-4.58542),('M',-4.73073),('Y',-4.83201),('C',-5.02805),]
tokenizer = Tokenizer(Unigram(vocab=vocab))
seq = 'ADV'
_ = tokenizer.encode(seq) # works fine
oov_seq = '@DV'
_ = tokenizer.encode(oov_seq) # Exception: Encountered an unknown token but `unk_id` is missing
Is there any method to add an unk
token to the tokenizer?
Thanks
Hi @Narsil,
Sorry to bother, but could you help?
@marcmk6
Sorry didn't see this the first time.
Try doing Unigram(vocab=vocab, unk_id=0)
(It's either unk
or unk_id
I don't remember).
Bascially the vocab has no idea what is your unk, so Unigram needs to be told what is your unk. And since unk is not necessarily needed (for byte level for instance) then, it's not set when not supplied, which means the model does not know how to fill the ids.
Is that clearer ?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.