tokenizers
tokenizers copied to clipboard
`WordPieceTrainer.train_from_iterator` is not deterministic
I am not sure if this is expected behavior, but tokenizer, trained on the same data sometimes encodes same data differently
reproducer
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
res = []
for _ in range(100):
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = WordPieceTrainer(special_tokens=["[UNK]"])
tokenizer.train_from_iterator(['Wel come to the 🤗 Tok en izers libr ary.'], trainer)
output = tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
res.append(tuple(output.tokens))
res = list(set(res))
assert len(res) == 2
print(res[0])
# ('[UNK]', 'to', 'the', '🤗', 'Tok', '##e', '##n', '##i', '##z', '##er', '##s', '[UNK]', '.')
print(res[1])
# ('[UNK]', 'to', 'the', '🤗', 'Tok', '##e', '##n', '##i', '##z', '##ers', '[UNK]', '.')
Hey! I think it is expected:
- you have dropout
- you are not setting a seed.
I couldn't find mention of dropout neither in documentation of WordPiece model nor WordPieceTrainer
Also I couldn't instantiate either of those classes with parameter dropout, I always get warning Ignored unknown kwargs option dropout.
What is the proper way to specify a dropout for the WordPiece tokenizer?