tokenizers Why the tokenizer is slower than tiktoken?

Why the tokenizer is slower than tiktoken?

Open BigBinnie opened this issue 9 months ago • 8 comments

Hi, I tried to use the GPT2 tokenizer of HF and TikToken, but I found TikToken is faster than HF. Could you explain why this might happen?

Apr 29 '24 23:04 BigBinnie

Hey, could you share a reproducer? Some things are related to the fact that we keep track of the offset and a lot of information, which tiktoken does not. But we could only do this when ask and improve speed potentially.

Apr 30 '24 10:04 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 31 '24 01:05 github-actions[bot]

It's high in my priority to do benchmarks and improve our code if needed!

Jun 05 '24 07:06 ArthurZucker

For HF, we use

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "xxx"
start = time.time()
encoded_input = tokenizer.encode(truncated_text)
end = time.time()

For tiktoken, we just initialize the tokenizer by tiktoken, all the other are the same

tokenizer = tiktoken.encoding_for_model("gpt-2")

please let me know if you need any other information

Jun 20 '24 22:06 BigBinnie

You are using GPT2Tokenizer which is the slow one. Use GPT2TokenizerFast 😅

Jun 21 '24 08:06 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jul 22 '24 01:07 github-actions[bot]

We actually dived a bit:

Rayon parallelism is kinda broken
we have concurency on the cache for GPT2
We have memory allocation that are also slowing down With #1560, was able to get similar performances as tiktoken, keep posted 😉

Jul 31 '24 13:07 ArthurZucker

One thing tho, is that tiktoken forces the spilt of very long sequences. If you split them in batch you are already gonna have quite a lot better perfs

Jul 31 '24 13:07 ArthurZucker

tokenizers tokenizers copied to clipboard

Why the tokenizer is slower than tiktoken?

tokenizers
tokenizers copied to clipboard