tokenizers vocab_size issue with Whitespace pre

Hi,

I train a WordPiece tokenizer with a custom vocabulary size. But somehow the vocab size of my trained tokenizer gets much higher than my input size ( 16700).

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
tokenizer.normalizer = normalizers.Sequence([NFC(), Lowercase()])
trainer = WordPieceTrainer(vocab_size=16700, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(batch_iterator(tokenizer_batch_size, dataset), trainer=trainer, length=len(dataset))

The output tokenizer has this vocab_size: tokenizer vocab size: 28582

I found that the output becomes correct when I use ByteLevel() pre_tokenizer instead of Whitespace()

I think that there can be an issue with using Whitespace and Trainer together.

Jan 19 '22 09:01 ctoraman

This example seems to work correctly:

from tokenizers import Tokenizer, models, pre_tokenizers, normalizers, trainers
import datasets


def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]


dataset = datasets.load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.normalizer = normalizers.Sequence([normalizers.NFC(), normalizers.Lowercase()])
trainer = trainers.WordPieceTrainer(
    vocab_size=16700, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=len(dataset))

print(tokenizer.get_vocab_size())
#16700

Is there anyway you could provide a reproducible script ? Is your data using a large variety of unicode codepoints ? If then, it's possible that the resulting tokenizer consists of solely all unicode points (and what explain why ByteLevel would work since it's splitting those). The tokenizer training needs an initial alphabet which consists of all unicode chars and doesn't even attempt to trim them so that could be it.

Jan 19 '22 11:01 Narsil

@Narsil thanks for the answer.

Please try your script with this dataset to reproduce my case.

dataset = load_dataset("oscar", "unshuffled_deduplicated_tr")["train"]

As you mentioned, my dataset has many unicode since it is Turkish.

Jan 19 '22 12:01 ctoraman

Sorry I can't download that much (36Go) right now. Could you share your output tokenizer ? We cuold then check my first hypothesis, it will probably be faster.

Jan 19 '22 13:01 Narsil

tokenizer.zip

(with the first 10k tokens)

Jan 19 '22 15:01 ctoraman

The tokenizer seems incomplete, but does contain every chinese/japanses character on own, so my guess is probably correct, you need a ByteLevel of some kind because you're current alphabet is bigger than what you expect.

If you are training for Turkish, I would suggest removing any non turkish unicode character altogether if you don't want to use ByteLevel.

Other note, you probably want to remove this chinese/japanese data (and probably many other) from your training data to get better speed, because just looking at the tokenizer it seems your dataset is very multilingual (not a bad thing, but if you think it's just turkish, this assumption seems to be incorrect).

Jan 20 '22 08:01 Narsil

I already filtered the dataset for non-Turkish sentences, but still got those thousands of chinese/japanese characters when I use Whitespace.

I got rid of most of them when I used ByteLevel. It also provided the same vocab size with my input size.

I have not tried but another solution can be using initial_alphabet and limit_alphabet parameters.

Thanks for help.

Jan 20 '22 09:01 ctoraman

You could try to filter that data manually by loooking at the unicode char script of each character.

https://stackoverflow.com/questions/9868792/find-out-the-unicode-script-of-a-character (First and last answer seem viable IMO.)

Something along the lines of:

all_sentences = []
for sentence in dataset:
    for c in sentence:
        if unicode_script(c) != {"Turkish, ascii, ... ?}:
            continue
     all_sentences.append(sentence)

However this also seems like that data wasn't properly filtered, wasn't it (I know OSCAR is web content, but it's supposed to have been curated, no, so why would chinese characters appear in the Turkish dataset ?)

Jan 20 '22 09:01 Narsil

I do not know why, but I found that OSCAR's Turkish split has many non-Turkish webpages, probably missed by the curators. I found them by a language detector. I have not tried unicode_script as you suggested. But still I need to remove illegal characters, because as far as I understand, ByteLevel does not remove them, it only changes the representation.

But independent from this dataset issue, if data has many unicode tokens, Whitespace does not work properly (at least I expect it to limit trimming at my input vocab_size).

Jan 20 '22 11:01 ctoraman

On the core issue of wether the vocab should be trimmed or not, I am a little torn. I do tend to sympathize with your expectations.

If a user wants vocab_size=X then the final result should be X or less.

But,

Initial alphabet is kind of a core requirement of the algorithm. We can remove part of it by using limit_alphabet:

trainer = trainers.WordPieceTrainer(..., limit_alphabet=1000)

for instance, which you might want to use.

but even then, because by default WordPiece uses continuing subword prefix then, to get a "alphabet limit" of 24, we need 48 characters ("a", "##a", "b", "##b", ...) So that "abba" can get the ids [0, 3, 3, 2] as the algorithm should (the first part of words get a different tokenization as in the middle). So vocab_size = 2 x limit_alphabet (this isn't the case for continuing_subword_prefix=""..)
BPE (WordPiece uses it for training) was originally intended to work without unk and the whole goal was to remove them from tokenized strings. meaning if we remove part of the initial alphabet, we cannot even tokenize the original dataset (it would crash as we don't have a dedicated unk. This would be also kind of a very odd behavior.
Finally you also specify here 5 added tokens ("[SEP], ..."). What if you provided more than vocab_size, or more should we remove some ? Should we crash ? Should we honor limit_alphabet, vocab_size or added_tokens and in which priority.

All in all, it means that this change, wouldn't be exactly straightforward so I won't jump on it. The current behavior keeps point 3 alive, and still raised your concern (which is good IMHO). Final note, sentencepiece deals with it in roughly the same way, they use --alphabet_coverage=0.9995 or something, which is described more in the percentage of the initial alphabet you want to keep (and crash/unk on the rest)

Can limit_alphabet option be a reasonable workaround for you in the meantime ?

Jan 20 '22 16:01 Narsil

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Mar 02 '24 01:03 github-actions[bot]

vocab_size issue with Whitespace pre_tokenizer