tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Why the tokenizer is slower than tiktoken?

Open BigBinnie opened this issue 9 months ago • 8 comments

Hi, I tried to use the GPT2 tokenizer of HF and TikToken, but I found TikToken is faster than HF. Could you explain why this might happen?

Screen Shot 2024-04-29 at 6 43 14 PM

BigBinnie avatar Apr 29 '24 23:04 BigBinnie

Hey, could you share a reproducer? Some things are related to the fact that we keep track of the offset and a lot of information, which tiktoken does not. But we could only do this when ask and improve speed potentially.

ArthurZucker avatar Apr 30 '24 10:04 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar May 31 '24 01:05 github-actions[bot]

It's high in my priority to do benchmarks and improve our code if needed!

ArthurZucker avatar Jun 05 '24 07:06 ArthurZucker

For HF, we use

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "xxx"
start = time.time()
encoded_input = tokenizer.encode(truncated_text)
end = time.time()

For tiktoken, we just initialize the tokenizer by tiktoken, all the other are the same

tokenizer = tiktoken.encoding_for_model("gpt-2")

please let me know if you need any other information

BigBinnie avatar Jun 20 '24 22:06 BigBinnie

You are using GPT2Tokenizer which is the slow one. Use GPT2TokenizerFast 😅

ArthurZucker avatar Jun 21 '24 08:06 ArthurZucker

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jul 22 '24 01:07 github-actions[bot]

We actually dived a bit:

  1. Rayon parallelism is kinda broken
  2. we have concurency on the cache for GPT2
  3. We have memory allocation that are also slowing down With #1560, was able to get similar performances as tiktoken, keep posted 😉

ArthurZucker avatar Jul 31 '24 13:07 ArthurZucker

One thing tho, is that tiktoken forces the spilt of very long sequences. If you split them in batch you are already gonna have quite a lot better perfs

ArthurZucker avatar Jul 31 '24 13:07 ArthurZucker