tokenizers Adding many AddedTokens makes loading a tokenizer extremely slow.

trafficstars

Hi!

I'm not sure if this is a problem that can be solved, or needs to be solved. Basically, we want to make a kind of hybrid tokenizer, in which we add a whole bunch of whole words to a tokenizer, and select these words instead of the subwords if they appear.

For example: if we pass the pretokenized string ["dog", "walks", "around", "Paris"], and "Paris" is a whole token, we want to select it instead of decomposing it into subtokens. I think that adding Paris as an AddedToken is the right approach for this (but please correct me if I'm wrong.)

So, we added many of these tokens (about 400k), but this makes loading a tokenizer extremely slow, like, it takes 15-30 minutes to load. We now add them as regular tokens, which works fine, but which has the downside of also finding these whole word tokens as part of other words. For example Parisians will now be turned into ["Paris", "##ians"], which might have a different meaning.

So my main question is: is there a reason why adding many AddedTokens is slow? Or is this just a path that hasn't been fully optimized yet?

Is using AddedTokens in this way simply wrong? Should we be trying something else?

Thanks! Stéphan

Sep 25 '24 07:09 stephantul

Hey! It depends on which API you are using! If you are using transformers it was kind of expected as adding special and non special was hard. If you are using pure tokenizers, one thing is we have to add new regex match cases for each new token.

Oct 01 '24 12:10 ArthurZucker

If you want to use a better way, I would recommend you to add them as regular + make sure you add the merge rules! This means adding paths to fusing these tokens! THis can be automatically done. If that is of interest to you, provide me a reproducer with a model on the hub and I can helP!

Oct 01 '24 12:10 ArthurZucker

Hey @ArthurZucker , thanks for your response!

I'm using the pure tokenizers API. However, I am using a WordPiece tokenizer (actually just the baai/bge-base-en-v1.5 tokenizer, which AFAIK is just the OG bert tokenizer), not a BPE tokenizer. I see how adding merges to the BPE tokenizer could lead to a good solution though, so that's a cool idea.

So my vocabulary is a list with 400k tokens (just the vocabulary of the GLoVe vectors). So assuming vocab is a list of 400k strings, this already takes a lot of time:

from tokenizers import Tokenizer

tok = Tokenizer.from_pretrained("baai/bge-base-en-v1.5")
tok.add_tokens(vocab)

This wouldn't really matter to me, but this cost is incurred every time the tokenizer is loaded from disk, which makes the cost of using it prohibitive. I could maybe convert it to BPE, but I'm not sure if that makes sense.

I'll upload the resulting tokenizer once it's done, and post another comment.

Thanks!

Oct 01 '24 12:10 stephantul

Here you go: https://huggingface.co/stephantulkens/large_tokenizer/tree/main

Oct 01 '24 12:10 stephantul

Hi @ArthurZucker I'm also struggling with this issue. I'm adding 100k+ tokens manually (not by training) to my BPE tokenizer using raw tokenizers library. Can you show me how to add them as regular tokens instead of added tokens?

If I use added tokens, just loading of the tokenizer from disk takes about 3 minutes, which is very inconvenient.

Mar 10 '25 18:03 prompteus

Noting both comments, will see if I can do something!

May 27 '25 07:05 ArthurZucker

Update the MAIN issue when loading the added vocab is that we have to initialize the aho corasick for regex pattern. Removing the normalization from it gives me this:

    Finished `bench` profile [optimized] target(s) in 27.38s
     Running benches/added_vocab_deserialize.rs (/Users/arthurzucker/Work/tokenizerMe/tokenizers/target/release/deps/added_vocab_deserialize-f039aa3f75160f32)
Gnuplot not found, using plotters backend
deserialize_added_vocab_10000
                        time:   [6.4288 ms 6.4710 ms 6.5190 ms]
                        change: [-88.343% -88.195% -88.047%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  11 (11.00%) high mild
  2 (2.00%) high severe

Benchmarking deserialize_added_vocab_100000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.3s, or reduce sample count to 60.
deserialize_added_vocab_100000
                        time:   [72.712 ms 73.058 ms 73.504 ms]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

Benchmarking deserialize_added_vocab_400000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 35.1s, or reduce sample count to 10.
deserialize_added_vocab_400000
                        time:   [345.91 ms 348.41 ms 351.27 ms]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

     Running benches/bert_benchmark.rs (/Users/arthurzucker/Work/tokenizerMe/tokenizers/target/release/deps/bert_benchmark-f29cc41b45854ef9)
Gnuplot not found, using plotters backend

May 27 '25 10:05 ArthurZucker

I will see to make it better, either doing a single normalization patth, then splitting the tokens on something

May 27 '25 10:05 ArthurZucker

Nice! In the meantime we've just added the tokens as regular tokens, which is a lot faster and also kind of works (but requires to manually edit the JSON 😆 ), so super thanks for picking this up.

May 27 '25 10:05 stephantul

Hahha happy to hear, #1782 to follow it!

May 27 '25 10:05 ArthurZucker

~Okay it's fixed! ~ more optimized I would say haha

May 27 '25 13:05 ArthurZucker

@ArthurZucker this is an absolute godsend. I have a BPE tokenizer augmented with 131k tokens from an audio codec and it was taking ~5 minutes to load, every time I wanted to test something. I pulled down this PR and now it takes 4 seconds. This fix will give me hours of my life back 😆 Hope it gets merged soon!

Jul 07 '25 21:07 AbrahamSanders

tokenizers tokenizers copied to clipboard

Adding many AddedTokens makes loading a tokenizer extremely slow.

tokenizers
tokenizers copied to clipboard