tokenizers
tokenizers copied to clipboard
BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased
I'm training a new tokenizer on an Indic language, Tamil. I tried two different runs:
Test run with part of the data used for training ~0.3Gb
from datasets import load_dataset
from tokenizers import Tokenizer, trainers, models, pre_tokenizers
ta_data = load_dataset("ai4bharat/sangraha", cache_dir='./datasets', data_files='verified/tam/data-0.parquet', split='train')
ta_data = ta_data.remove_columns([
col for col in ta_data.column_names if col != "text"
])
def batch_iterator(dataset, batch_size):
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
tokenizer = Tokenizer(models.BPE(unk_token='[UNK]'))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.add_special_tokens(['[ta]'])
special_tokens = ["[STOP]","[UNK]","[SPACE]","[ta]"]
trainer = trainers.BpeTrainer(vocab_size=2000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(ta_data, 100), trainer=trainer, length=len(ta_data))
tokenizer.save('./ta_vocab_pretok_2000.json')
This gives me a vocab file with exactly 2000 tokens as here and merges are computed correctly. ta_vocab_pretok_2000.json
Run with entire data used for training ~ 15Gb
from datasets import load_dataset
from tokenizers import Tokenizer, trainers, models, pre_tokenizers
ta_data = load_dataset("ai4bharat/sangraha", cache_dir='./datasets', data_files='verified/tam/*', split='train')
ta_data = ta_data.remove_columns([
col for col in ta_data.column_names if col != "text"
])
def batch_iterator(dataset, batch_size):
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
tokenizer = Tokenizer(models.BPE(unk_token='[UNK]'))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.add_special_tokens(['[ta]'])
special_tokens = ["[STOP]","[UNK]","[SPACE]","[ta]"]
trainer = trainers.BpeTrainer(vocab_size=2000, special_tokens=special_tokens)
tokenizer.train_from_iterator(batch_iterator(ta_data, 100), trainer=trainer, length=len(ta_data))
tokenizer.save('./ta_vocab_pretok_2000.json')
This gives me a much larger vocab file with no merges. Vocab count is ~5800, ignores the value 2000 I passed to the trainer. ta_vocab_pretok_2000_full_data.json
Questions:
- Why does the tokenizer ignore the vocab_size parameter in the trainer ?
- Where are the non-tamil tokens coming from ? The emojis, the greek, arabic and other language tokens ?
+1 also encountered same problem, tried use trainers.WordPieceTrainer
, and BertTokenizerFast.train_new_from_iterator
also have same result, they also not respects vocab_size parameter
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
That is because your vocab_size is too small. BPE stops when current vocab size is larger than your target vocab size, and BPE merging subword leads to vocab size growing bigger