transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Slow decoding with many special tokens in vocabulary

Open samsontmr opened this issue 2 years ago • 3 comments

System Info

present across multiple versions

Who can help?

@younesbelkada

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

from transformers import T5Tokenizer
from time import time
from random import randint

t1 = T5Tokenizer.from_pretrained('t5-base')
t2 = T5Tokenizer.from_pretrained('t5-base', extra_ids=2000)
to_decode = [randint(0, 32000) for i in range(10000)]

start = time()
t1.decode(to_decode)
print("few special tokens:", time() - start)

start = time()
t2.decode(to_decode)
print("many special tokens:", time() - start)

Expected behavior

The slowdown should not be so drastic. The cause is an inefficient implementation of all_special_ids and all_special_tokens. Additionally, generating them on the fly incurs a large overhead since this attribute is queried for every id to be decoded (here and here).

samsontmr avatar Feb 22 '23 22:02 samsontmr

Slow tokenizers are... slow. That's why we wrote the tokenizers library ;-) Why not use T5TokenzierFast which doesn't have the same problem?

sgugger avatar Feb 23 '23 09:02 sgugger

T5TokenzierFast does not have byte-fallback + why artificially handicap the slow tokenizer if it could be more efficient (using sets instead of lists and computing the attribute only when it's updated)?

samsontmr avatar Feb 23 '23 21:02 samsontmr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Mar 25 '23 15:03 github-actions[bot]