tokenizers "the" token is splitted to "t" "h" "e" in large scale corpus

I train the bpe (BERTWordPiece) by my own, using the RoBERTa 160GB data. However, I found all "the" is broken. I check the learned vocab.txt, and found "the" is not in there too.

The parameters used in training bpe.

bpe.train(args.inputs, vocab_size=32768, min_frequency=20)

I guess the count of the may overflow in the large-scale corpus.

an example

john . who . like me . is a writer with an interest in t ##h ##e interp ##lay of words and images . assumed i ' d hidden a cryptic message in t ##h ##e random sequence of little pictures . " write a package . spider earth cutting ? "
he asked . i only responded with five more random emoji . " t ##h ##e cheese running bear got caught in a rains ##how ##er looking for diamonds . " he muse ##d .
and thus began a game that has continued to this day .
one of us texts five random emoji to t ##h ##e other . and t ##h ##e recipient has to write a story ( roughly of tweet length ) to explain each of t ##h ##e emoji in order .
below are eight of our collaborative creations . 
james hannah ##am ' s most recent novel . delicious foods . won t ##h ##e pen / faulk ##ner and hurst ##on / wright legacy awards . and was a new york times notable book .
he teaches in t ##h ##e writing program at t ##h ##e pratt institute

Sep 25 '20 01:09 guolinke

It's possible, I made a temporary branch with u64 everywhere instead of u32:

https://github.com/huggingface/tokenizers/tree/u64_branch

It's a temporary fix, we probably need to think a bit more about how to go about it in the general case.

You can install from source by going into bindings/python, and then pip install -e .

Sep 25 '20 09:09 Narsil

Thanks for the quick response. Actually, I made the as a special token as a workaround, since other high-frequency (like and) tokens seems ok.

Sep 25 '20 09:09 guolinke

If "th" and "he" are not tokens either you're probably going to have a subpar tokenizer

Sep 25 '20 09:09 Narsil

@Narsil Thanks, I just check it, and seems both th and he is there. I think it is better to fix the overflow problem, in case we have the larger corpus.

Sep 25 '20 09:09 guolinke

I didn't mean to close.

Sep 25 '20 09:09 Narsil

seems u64 branch cannot fix the problem. the error

thread '<unnamed>' panicked at 'Trainer should know how to build BPE: MergeTokenOutOfVocabulary("##h##e")', /home/xxx/tokenizers/tokenizers/src/models/bpe/trainer.rs:573:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "train_wp.py", line 55, in <module>
    main()
  File "train_wp.py", line 50, in main
    bpe.train(args.inputs, vocab_size=args.vocab_size, special_tokens=special_tokens,
  File "/home/xxx/tokenizers/bindings/python/py_src/tokenizers/implementations/bert_wordpiece.py", line 117, in train
    self._tokenizer.train(trainer, files)
pyo3_runtime.PanicException: Trainer should know how to build BPE: MergeTokenOutOfVocabulary("##h##e")

Sep 26 '20 03:09 guolinke

It seems the master branch also failed with the same error. the same code in 0.8.1 could be run.

    bpe = BertWordPieceTokenizer(clean_text=True, strip_accents=True, lowercase=True)
    bpe.train(args.inputs)

updated: it seems this commit (https://github.com/huggingface/tokenizers/commit/95cc8c47ad2fc1af13be406e3bdcdbd2f09fe71e) causes the problem.

Sep 26 '20 09:09 guolinke

master branch seems is much slower in file reading.

0.8.1

[00:03:13] Reading files (19521 Mo)                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████                 100
[00:00:17] Tokenize words                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:00:14] Count pairs                              ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:00:58] Compute merges                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 31116    /    31116

master

[00:17:46] Reading files (19521 Mo)                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████                 100
[00:00:18] Tokenize words                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:00:15] Count pairs                              ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 6593316  /  6593316
[00:01:13] Compute merges                           ███████████████████████████████████████████████████████████████████████████████████████████████████████████ 31117    /    31117

Sep 26 '20 10:09 guolinke

So there is definitely a slowdown with master that is supposed to be expected, that's linked to better (and correct) offsets tracking. However it should not be that bad, I'm going to try and replicate this. (It does not seem to be the case on smaller files ~500Mo, the slowdown is only ~20%)

Sep 28 '20 10:09 Narsil

This is the sort of variance I'm getting:

master:

[00:01:37] Reading files (1048 Mo)                  █████████████████████████████████████████████████████████████████████                 100
[00:00:02] Tokenize words                           █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:09] Count pairs                              █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:10] Compute merges                           █████████████████████████████████████████████████████████████████████ 28375    /    28375

v.0.8.1

[00:01:20] Reading files (1048 Mo)                  █████████████████████████████████████████████████████████████████████                 100
[00:00:03] Tokenize words                           █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:10] Count pairs                              █████████████████████████████████████████████████████████████████████ 1882027  /  1882027
[00:00:11] Compute merges                           █████████████████████████████████████████████████████████████████████ 28374    /    28374

Are you sure your master is up to date ? This kind of slowdown you're experiencing looks like debug build kind of problem.

Code used:

from tokenizers import BertWordPieceTokenizer

bpe = BertWordPieceTokenizer(clean_text=True, strip_accents=True, lowercase=False)
bpe.train(["out1.txt"])

print(bpe.encode("I love this [SEP]").ids)
bpe.save("tokenizer.json", pretty=True)

Sep 28 '20 11:09 Narsil

I used this commit https://github.com/huggingface/tokenizers/commit/36832bfa1292b298c36447d7d2f5a3366c26644c, as the latter of it cannot run. I also set RAYON_RS_NUM_CPUS=16 , maybe the multi-threading performance is down?

Sep 28 '20 13:09 guolinke

Has there been any progress on integrating this into main? We are running into the same issue with the count of many two-character tokens overflowing and not ending up in the final vocab with a sufficiently large text corpus.

Mar 31 '22 17:03 gpauloski

@gpauloski Not yet, the tentative branch proposed here is super old.

You're welcome to attempt a PR, but making everything templated for both u32 and u64 is not that trivial in Rust (and we need the logic to use u64 only when necessary.

So far the recommendation has been to use something like spm which can handle u64 for training, and converting it to tokenizers (this is what happened for bigscience)

Apr 01 '22 07:04 Narsil

Thanks, @Narsil! I'll try spm.

Apr 01 '22 15:04 gpauloski

Actually keeping the tokenizer training data under u32 is the final route that was taken for bigscience (but the spm u64 was also used when trying to figure out the best tokenizer)

Apr 04 '22 07:04 Narsil

tokenizers tokenizers copied to clipboard

"the" token is splitted to "t" "h" "e" in large scale corpus

tokenizers
tokenizers copied to clipboard