tokenizers
tokenizers copied to clipboard
Why the tokenizer is slower than tiktoken?
Hi, I tried to use the GPT2 tokenizer of HF and TikToken, but I found TikToken is faster than HF. Could you explain why this might happen?
Hey, could you share a reproducer? Some things are related to the fact that we keep track of the offset and a lot of information, which tiktoken does not. But we could only do this when ask and improve speed potentially.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
It's high in my priority to do benchmarks and improve our code if needed!
For HF, we use
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "xxx"
start = time.time()
encoded_input = tokenizer.encode(truncated_text)
end = time.time()
For tiktoken, we just initialize the tokenizer by tiktoken, all the other are the same
tokenizer = tiktoken.encoding_for_model("gpt-2")
please let me know if you need any other information
You are using GPT2Tokenizer
which is the slow one. Use GPT2TokenizerFast 😅
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
We actually dived a bit:
- Rayon parallelism is kinda broken
- we have concurency on the cache for GPT2
- We have memory allocation that are also slowing down With #1560, was able to get similar performances as tiktoken, keep posted 😉
One thing tho, is that tiktoken forces the spilt of very long sequences. If you split them in batch you are already gonna have quite a lot better perfs