transformers Slow decoding with many special tokens in vocabulary

Slow decoding with many special tokens in vocabulary

Open samsontmr opened this issue 2 years ago • 3 comments

System Info

present across multiple versions

Who can help?

@younesbelkada

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import T5Tokenizer
from time import time
from random import randint

t1 = T5Tokenizer.from_pretrained('t5-base')
t2 = T5Tokenizer.from_pretrained('t5-base', extra_ids=2000)
to_decode = [randint(0, 32000) for i in range(10000)]

start = time()
t1.decode(to_decode)
print("few special tokens:", time() - start)

start = time()
t2.decode(to_decode)
print("many special tokens:", time() - start)

Expected behavior

The slowdown should not be so drastic. The cause is an inefficient implementation of all_special_ids and all_special_tokens. Additionally, generating them on the fly incurs a large overhead since this attribute is queried for every id to be decoded (here and here).

Feb 22 '23 22:02 samsontmr

Slow tokenizers are... slow. That's why we wrote the tokenizers library ;-) Why not use T5TokenzierFast which doesn't have the same problem?

Feb 23 '23 09:02 sgugger

T5TokenzierFast does not have byte-fallback + why artificially handicap the slow tokenizer if it could be more efficient (using sets instead of lists and computing the attribute only when it's updated)?

Feb 23 '23 21:02 samsontmr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 25 '23 15:03 github-actions[bot]

transformers transformers copied to clipboard

Slow decoding with many special tokens in vocabulary

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard