tokenizers
tokenizers copied to clipboard
Adding many AddedTokens makes loading a tokenizer extremely slow.
Hi!
I'm not sure if this is a problem that can be solved, or needs to be solved. Basically, we want to make a kind of hybrid tokenizer, in which we add a whole bunch of whole words to a tokenizer, and select these words instead of the subwords if they appear.
For example: if we pass the pretokenized string ["dog", "walks", "around", "Paris"], and "Paris" is a whole token, we want to select it instead of decomposing it into subtokens. I think that adding Paris as an AddedToken is the right approach for this (but please correct me if I'm wrong.)
So, we added many of these tokens (about 400k), but this makes loading a tokenizer extremely slow, like, it takes 15-30 minutes to load. We now add them as regular tokens, which works fine, but which has the downside of also finding these whole word tokens as part of other words. For example Parisians will now be turned into ["Paris", "##ians"], which might have a different meaning.
So my main question is: is there a reason why adding many AddedTokens is slow? Or is this just a path that hasn't been fully optimized yet?
Is using AddedTokens in this way simply wrong? Should we be trying something else?
Thanks! Stéphan
Hey! It depends on which API you are using!
If you are using transformers it was kind of expected as adding special and non special was hard.
If you are using pure tokenizers, one thing is we have to add new regex match cases for each new token.
If you want to use a better way, I would recommend you to add them as regular + make sure you add the merge rules! This means adding paths to fusing these tokens! THis can be automatically done. If that is of interest to you, provide me a reproducer with a model on the hub and I can helP!
Hey @ArthurZucker , thanks for your response!
I'm using the pure tokenizers API. However, I am using a WordPiece tokenizer (actually just the baai/bge-base-en-v1.5 tokenizer, which AFAIK is just the OG bert tokenizer), not a BPE tokenizer. I see how adding merges to the BPE tokenizer could lead to a good solution though, so that's a cool idea.
So my vocabulary is a list with 400k tokens (just the vocabulary of the GLoVe vectors). So assuming vocab is a list of 400k strings, this already takes a lot of time:
from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("baai/bge-base-en-v1.5")
tok.add_tokens(vocab)
This wouldn't really matter to me, but this cost is incurred every time the tokenizer is loaded from disk, which makes the cost of using it prohibitive. I could maybe convert it to BPE, but I'm not sure if that makes sense.
I'll upload the resulting tokenizer once it's done, and post another comment.
Thanks!
Here you go: https://huggingface.co/stephantulkens/large_tokenizer/tree/main
Hi @ArthurZucker
I'm also struggling with this issue. I'm adding 100k+ tokens manually (not by training) to my BPE tokenizer using raw tokenizers library. Can you show me how to add them as regular tokens instead of added tokens?
If I use added tokens, just loading of the tokenizer from disk takes about 3 minutes, which is very inconvenient.
Noting both comments, will see if I can do something!
Update the MAIN issue when loading the added vocab is that we have to initialize the aho corasick for regex pattern. Removing the normalization from it gives me this:
Finished `bench` profile [optimized] target(s) in 27.38s
Running benches/added_vocab_deserialize.rs (/Users/arthurzucker/Work/tokenizerMe/tokenizers/target/release/deps/added_vocab_deserialize-f039aa3f75160f32)
Gnuplot not found, using plotters backend
deserialize_added_vocab_10000
time: [6.4288 ms 6.4710 ms 6.5190 ms]
change: [-88.343% -88.195% -88.047%] (p = 0.00 < 0.05)
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
11 (11.00%) high mild
2 (2.00%) high severe
Benchmarking deserialize_added_vocab_100000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.3s, or reduce sample count to 60.
deserialize_added_vocab_100000
time: [72.712 ms 73.058 ms 73.504 ms]
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
Benchmarking deserialize_added_vocab_400000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 35.1s, or reduce sample count to 10.
deserialize_added_vocab_400000
time: [345.91 ms 348.41 ms 351.27 ms]
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) low mild
4 (4.00%) high mild
4 (4.00%) high severe
Running benches/bert_benchmark.rs (/Users/arthurzucker/Work/tokenizerMe/tokenizers/target/release/deps/bert_benchmark-f29cc41b45854ef9)
Gnuplot not found, using plotters backend
I will see to make it better, either doing a single normalization patth, then splitting the tokens on something
Nice! In the meantime we've just added the tokens as regular tokens, which is a lot faster and also kind of works (but requires to manually edit the JSON 😆 ), so super thanks for picking this up.
Hahha happy to hear, #1782 to follow it!
~Okay it's fixed! ~ more optimized I would say haha
@ArthurZucker this is an absolute godsend. I have a BPE tokenizer augmented with 131k tokens from an audio codec and it was taking ~5 minutes to load, every time I wanted to test something. I pulled down this PR and now it takes 4 seconds. This fix will give me hours of my life back 😆 Hope it gets merged soon!