transformers
transformers copied to clipboard
Slow decoding with many special tokens in vocabulary
System Info
present across multiple versions
Who can help?
@younesbelkada
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
from transformers import T5Tokenizer
from time import time
from random import randint
t1 = T5Tokenizer.from_pretrained('t5-base')
t2 = T5Tokenizer.from_pretrained('t5-base', extra_ids=2000)
to_decode = [randint(0, 32000) for i in range(10000)]
start = time()
t1.decode(to_decode)
print("few special tokens:", time() - start)
start = time()
t2.decode(to_decode)
print("many special tokens:", time() - start)
Expected behavior
The slowdown should not be so drastic. The cause is an inefficient implementation of all_special_ids
and all_special_tokens
. Additionally, generating them on the fly incurs a large overhead since this attribute is queried for every id to be decoded (here and here).
Slow tokenizers are... slow. That's why we wrote the tokenizers library ;-) Why not use T5TokenzierFast
which doesn't have the same problem?
T5TokenzierFast
does not have byte-fallback + why artificially handicap the slow tokenizer if it could be more efficient (using sets instead of lists and computing the attribute only when it's updated)?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.