tokenizers
tokenizers copied to clipboard
Pretrained BertWordPieceTokenizer loads with different parameters
I'm training a WordPiece tokenizer and saving it via save_model()
, but when loaded from the save directory, the tokenizer doesn't work as before saving.
Training
from tokenizers import BertWordPieceTokenizer
train_files = ["plain_text.txt"]
# initialize
tokenizer = BertWordPieceTokenizer(
clean_text=True,
handle_chinese_chars=False,
strip_accents=False,
lowercase=False
)
# and train
tokenizer.train(files=train_files, vocab_size=8000, min_frequency=2,
limit_alphabet=1000, wordpieces_prefix='##',
special_tokens=[
"[UNK]", "[PAD]", "[SEP]", "[MASK]", "[CLS]"])
tokenizer.save_model("./tokenizer-trained2")
--> works as expected (no lowercasing, no accent stripping):
# CHECK TOKENIZATION
sample = "Одӥг гужем ӝытэ со чорыганы ӝутказ Кам дуре. Татарстанысь Голюшурмае вуим но, бордамы эшшо олокӧня"
result = tokenizer.encode(sample, add_special_tokens=True)
print(result.tokens)
print(tokenizer.decode(result.ids))
>>> ['Одӥг', 'гужем', 'ӝыт', '##э', 'со', 'чорыг', '##аны', 'ӝут', '##каз', 'Кам', 'дуре', '.', 'Татарстанысь', 'Г', '##олю', '##шур', '##ма', '##е', 'вуи', '##м', 'но', ',', 'борд', '##амы', 'эшшо', 'олок', '##ӧня']
>>> Одӥг гужем ӝытэ со чорыганы ӝутказ Кам дуре. Татарстанысь Голюшурмае вуим но, бордамы эшшо олокӧня
Loading pretrained
from transformers import BertTokenizer
tokenizer2 = BertTokenizer.from_pretrained("./tokenizer-trained2")
--> does lowercase, strips accents
# CHECK TOKENIZATION
sample = "Одӥг гужем ӝытэ со чорыганы ӝутказ Кам дуре. Татарстанысь Голюшурмае вуим но, бордамы эшшо олокӧня"
result = tokenizer2([sample], padding=False, add_special_tokens=False)
print(tokenizer2.convert_ids_to_tokens(result["input_ids"][0]))
print(tokenizer2.decode(result["input_ids"][0], clean_up_tokenization_spaces=True))
>>> Одӥг гужем ӝытэ со чорыганы ӝутказ Кам дуре. Татарстанысь Голюшурмае вуим но, бордамы эшшо олокӧня
>>> ['одиг', 'гужем', 'ж', '##ытэ', 'со', 'чорыг', '##аны', 'ж', '##ут', '##каз', 'кам', 'дуре', '.', 'тат', '##арст', '##анысь', 'г', '##олю', '##шур', '##ма', '##е', 'вуи', '##м', 'но', ',', 'борд', '##амы', 'эшшо', 'олок', '##он', '##я']
>>> одиг гужем жытэ со чорыганы жутказ кам дуре. татарстанысь голюшурмае вуим но, бордамы эшшо олоконя
But the vocabularies do match:
tokenizer.get_vocab() == tokenizer2.get_vocab()
>>> True
Possible source of the problem
In the tokenizer_config.json
file which appears after saving the parameters do not match with those of the saved tokenizer:
{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": false, "clean_text": true, "tokenizer_class": "BertTokenizer"}
If I manually edit the parameters, tokenizer loads as expected.
I've got transformers==4.18.0
and tokenizers==0.12.1
.
Not sure whether it is a bug or my misunderstanding. Thanks in advance for any help.
Hi, I have the same problem with the Udmurt language, could you write if you solved this problem?
Hi @ulyanaisaeva @codemurt ,
This is one of the more shady parts of this library unfortunately. tokenizer.train
actually uses another object call the trainer
which might not see some of the tokenizer
parameters, meaning it will not use lowercase=False
as it should and uses its own default lowercase=True
.
Do you mind checking if
from tokenizers import BertWordPieceTokenizer
train_files = ["plain_text.txt"]
# initialize
tokenizer = BertWordPieceTokenizer(
clean_text=True,
handle_chinese_chars=False,
strip_accents=False,
lowercase=False
)
# and train
tokenizer.train(files=train_files, vocab_size=8000, min_frequency=2,
limit_alphabet=1000, wordpieces_prefix='##',
lowercase=True, # or maybe do_lowercase=True, I don't remember <----------------------------------------------
special_tokens=[
"[UNK]", "[PAD]", "[SEP]", "[MASK]", "[CLS]"])
tokenizer.save_model("./tokenizer-trained2")
And see if it works better ?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.