tokenizers
tokenizers copied to clipboard
Tokenizer encode very slow
Hi All,
I have trained a tokenizer on my own dataset consisting of files with 50.000 lines of about 5.000 tokens each. The training process seems fast, all cores are utilised and it finishes in around 30 minutes for my dataset. However the encoding process of single sentences or even in batch appears really slow. The following code finishes in 30 seconds for a file of 50.000 lines.
t = Tokenizer.from_file('my-trained-tokenizer.json')
t.enable_padding(length=512)
t.enable_truncation(max_length=512)
# opening the file and reading the lines in memory takes 250ms
with open('my-file.txt') as f:
lines = [f'[start]{l}' for l in f]
# encode batch on the inputs takes 30 seconds for 50.000 lines
inputs = t.encode_batch(lines)
# looping and encoding one by one takes several minutes
for l in lines:
ids = t.encode(l).ids
For context opening a file reading the lines in memory looping in python over every line, splitting on space, looping on every token an replacing with an int from a dict lookup finishes in 6 seconds.
Are the speeds mentioned above normal ? Trying to tokenize on the fly using Tensorflow datasets is hopeless currently since my GPUs get 2% utilisation. Do I need to save the dataset in tokenized form ? This is also a costly process since it needs to be performed daily for a lot of data.
Similar observation here. I trained a tokenizer (Tokenizer(vocabulary_size=64000, model=SentencePieceBPE, unk_token=<unk>, replacement=▁, add_prefix_space=True, dropout=None)) on a dataset of 200k records. I want to use the tokenizer in a scikit-learn pipeline so I'm only interested in the resulting tokens of a text:
def huggingface_tokenize(text, tokenizer):
return tokenizer.encode(text).tokens
Compared to a standard sklearn CountVectorizer, the tokenization of an individual text is ~ 13x slower. See benchmark:
%timeit huggingface_tokenize(text, tokenizer)
>>> 25.6 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit sklearn_tokenize(text)
>>> 1.87 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Is this to be expected or is there a way to speed this up?
Did you install tokenizers from source ? With pip install -e . ?
Currently installing that way will work, but rust is in debug mode and not in release mode. That makes a huge difference. If you want to fix it right now, you just need to change
rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3)]
into
rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3, debug=False)],
We'll be updating that soon so that we can't shot ourselves in the foot in the future.
Otherwise could either of you release the files&code necessary to reproduce ? I'd be happy to find the bottlenecks and maybe fix them. That would be a huge help, thanks.
I installed tokenizers from PyPI, not from source.
Unfortunately, I cannot share the data but here's a slightly different reproducible example:
import pandas as pd
from tokenizers import SentencePieceBPETokenizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
data = fetch_20newsgroups()
vec = CountVectorizer()
vec.fit(data.data)
text = data.data[0]
sklearn_tokenize = vec.build_tokenizer()
%timeit sklearn_tokenize(text)
>>> 40.5 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
data_path = 'tokenize_benchmark.txt'
pd.Series(data.data).to_csv(data_path, header=False, index=False)
tokenizer = SentencePieceBPETokenizer()
tokenizer.train([data_path], vocab_size=16_000, min_frequency=2, limit_alphabet=1000)
def huggingface_tokenize(text, tokenizer):
return tokenizer.encode(text).tokens
%timeit huggingface_tokenize(text, tokenizer)
>>> 228 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
i've been lurking on this through mail notifications, I think part of the difference might be due to accesing the tokens attribute on the encoded tokens. This copies all tokens twice:
https://github.com/huggingface/tokenizers/blob/62c3d40f1129e055438fb81c118797182560c72b/bindings/python/src/encoding.rs#L89-L92
First to_vec() allocates a new String for each token in the Encoding and allocates a new Vec with at least n_tokens capacity. Then a PyList is allocated and each of the tokens from the new Vec is first converted to a PyString and then pushed to the PyList. It might even be that the PyList grows a few times and gets re-allocated, currently not sure about the implementation of Vec -> PyList conversion.
There might actually be an implementation for &[String] -> &PyList which could save at least the intermediate step in Rust. I don't think the conversion to PyList can be optimized a lot, tho.
edit: The implementation exists, that might cut some inefficiency.
https://github.com/PyO3/pyo3/blob/a0960f891801c0534856cb90fa90451828579470/src/types/list.rs#L165-L179
The other question is, what does skelearn_tokenize() do? Is it just white-space tokenizing? If so, then you're comparing very different methods and algorithms and it wouldn't be surprising at all that it's much faster.
Hi @sobayed ,
Thanks for the example, that was helpful ! As @sebpuetz mentionned, you are actually comparing 2 very different algorithms.
sklearn examples seems to be doing roughly whitespace splitting with some normalization.
huggingface does a BPE encoding algorithm.
The two are vastly different, the first one, will yield quite a bit a "Unk" tokens, or you will have a huge vocabulary size (which means machine learning models will be huge). BPE on the other hand was designed so that words that are not known, are split into parts that should make sense for the language.
Here is an example:
# 'supercool' is not in the train data
vec.vocabulary_['supercool']
# KeyError: 'supercool'
# On the other hand the BPE algorithm manages to split this word into parts
tokenizer.encode('supercool').tokens
# ['▁super', 'c', 'ool']
I hope that explains the observed differences. That does not prevent us from finding more optimisations in the future to bring that down even further.
@traboukos I hope your example falls in a similar category (or debug mode problem I mentionned), but it's hard to say without access to your tokenizer or data file or ones that can reproduce the problem (for instance BPE algorithm is known to be notably slow on languages that don't include whitespaces).
Hi @Narsil, many thanks for your explanations! I'm actually aware of the differences in the algorithms. My question was mainly whether the method I'm currently using is the fastest way to get the tokens using a trained tokenizer or whether there is a more efficient way.
If this is already the best that is possible at the moment, I'm fine with that.
Well you can have some speedups if you use encode_batch instead of encode as you can be able to use parallelization.
tokenizer.encode_batch([text, text, ....])
But it depends on your use case as depending on your code you might get a dead lock if you are already using threading in Python: #311
To actually get the speedup you need to set TOKENIZERS_PARALLELISM=1 mycommand.py, you should have a warning otherwise.
For my current use case, batch encoding is unfortunately not an option. Still good to know about it though for the future! Many thanks again for your help 👍
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.