tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

`WordPieceTrainer.train_from_iterator` is not deterministic

Open Tialo opened this issue 7 months ago • 2 comments

I am not sure if this is expected behavior, but tokenizer, trained on the same data sometimes encodes same data differently

reproducer

from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace

from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer

res = []
for _ in range(100):
    tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
    tokenizer.pre_tokenizer = Whitespace()
    trainer = WordPieceTrainer(special_tokens=["[UNK]"])
    tokenizer.train_from_iterator(['Wel come to the 🤗 Tok en izers libr ary.'], trainer)
    output = tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
    res.append(tuple(output.tokens))
res = list(set(res))
assert len(res) == 2
print(res[0])
# ('[UNK]', 'to', 'the', '🤗', 'Tok', '##e', '##n', '##i', '##z', '##er', '##s', '[UNK]', '.')
print(res[1])
# ('[UNK]', 'to', 'the', '🤗', 'Tok', '##e', '##n', '##i', '##z', '##ers', '[UNK]', '.')

Tialo avatar Jun 07 '25 16:06 Tialo

Hey! I think it is expected:

  1. you have dropout
  2. you are not setting a seed.

ArthurZucker avatar Jul 29 '25 13:07 ArthurZucker

I couldn't find mention of dropout neither in documentation of WordPiece model nor WordPieceTrainer Also I couldn't instantiate either of those classes with parameter dropout, I always get warning Ignored unknown kwargs option dropout.

What is the proper way to specify a dropout for the WordPiece tokenizer?

Tialo avatar Jul 30 '25 14:07 Tialo