tokenizers Tokenizer encode very slow

Hi All,

I have trained a tokenizer on my own dataset consisting of files with 50.000 lines of about 5.000 tokens each. The training process seems fast, all cores are utilised and it finishes in around 30 minutes for my dataset. However the encoding process of single sentences or even in batch appears really slow. The following code finishes in 30 seconds for a file of 50.000 lines.

t = Tokenizer.from_file('my-trained-tokenizer.json')
t.enable_padding(length=512)
t.enable_truncation(max_length=512)

# opening the file and reading the lines in memory takes 250ms
with open('my-file.txt') as f:
    lines = [f'[start]{l}' for l in f]

# encode batch on the inputs takes 30 seconds for 50.000 lines
inputs = t.encode_batch(lines)

# looping and encoding one by one takes several minutes
for l in lines:
    ids = t.encode(l).ids

For context opening a file reading the lines in memory looping in python over every line, splitting on space, looping on every token an replacing with an int from a dict lookup finishes in 6 seconds.

Are the speeds mentioned above normal ? Trying to tokenize on the fly using Tensorflow datasets is hopeless currently since my GPUs get 2% utilisation. Do I need to save the dataset in tokenized form ? This is also a costly process since it needs to be performed daily for a lot of data.

Sep 07 '20 17:09 traboukos

Similar observation here. I trained a tokenizer (Tokenizer(vocabulary_size=64000, model=SentencePieceBPE, unk_token=<unk>, replacement=▁, add_prefix_space=True, dropout=None)) on a dataset of 200k records. I want to use the tokenizer in a scikit-learn pipeline so I'm only interested in the resulting tokens of a text:

def huggingface_tokenize(text, tokenizer):
    return tokenizer.encode(text).tokens

Compared to a standard sklearn CountVectorizer, the tokenization of an individual text is ~ 13x slower. See benchmark:

%timeit huggingface_tokenize(text, tokenizer)
>>> 25.6 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit sklearn_tokenize(text)
>>> 1.87 µs ± 158 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Is this to be expected or is there a way to speed this up?

Sep 14 '20 13:09 sobayed

Did you install tokenizers from source ? With pip install -e . ?

Currently installing that way will work, but rust is in debug mode and not in release mode. That makes a huge difference. If you want to fix it right now, you just need to change

rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3)]

into

rust_extensions=[RustExtension("tokenizers.tokenizers", binding=Binding.PyO3, debug=False)],

We'll be updating that soon so that we can't shot ourselves in the foot in the future.

Otherwise could either of you release the files&code necessary to reproduce ? I'd be happy to find the bottlenecks and maybe fix them. That would be a huge help, thanks.

Sep 14 '20 15:09 Narsil

I installed tokenizers from PyPI, not from source.

Unfortunately, I cannot share the data but here's a slightly different reproducible example:

import pandas as pd

from tokenizers import SentencePieceBPETokenizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

data = fetch_20newsgroups()

vec = CountVectorizer()
vec.fit(data.data)
text = data.data[0]
sklearn_tokenize = vec.build_tokenizer()
%timeit sklearn_tokenize(text)
>>> 40.5 µs ± 454 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

data_path = 'tokenize_benchmark.txt'
pd.Series(data.data).to_csv(data_path, header=False, index=False)
tokenizer = SentencePieceBPETokenizer()
tokenizer.train([data_path], vocab_size=16_000, min_frequency=2, limit_alphabet=1000)

def huggingface_tokenize(text, tokenizer):
    return tokenizer.encode(text).tokens

%timeit huggingface_tokenize(text, tokenizer)
>>> 228 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Sep 15 '20 06:09 sobayed

i've been lurking on this through mail notifications, I think part of the difference might be due to accesing the tokens attribute on the encoded tokens. This copies all tokens twice: https://github.com/huggingface/tokenizers/blob/62c3d40f1129e055438fb81c118797182560c72b/bindings/python/src/encoding.rs#L89-L92

First to_vec() allocates a new String for each token in the Encoding and allocates a new Vec with at least n_tokens capacity. Then a PyList is allocated and each of the tokens from the new Vec is first converted to a PyString and then pushed to the PyList. It might even be that the PyList grows a few times and gets re-allocated, currently not sure about the implementation of Vec -> PyList conversion.

There might actually be an implementation for &[String] -> &PyList which could save at least the intermediate step in Rust. I don't think the conversion to PyList can be optimized a lot, tho.

edit: The implementation exists, that might cut some inefficiency.
https://github.com/PyO3/pyo3/blob/a0960f891801c0534856cb90fa90451828579470/src/types/list.rs#L165-L179

The other question is, what does skelearn_tokenize() do? Is it just white-space tokenizing? If so, then you're comparing very different methods and algorithms and it wouldn't be surprising at all that it's much faster.

Sep 15 '20 07:09 sebpuetz

Hi @sobayed ,

Thanks for the example, that was helpful ! As @sebpuetz mentionned, you are actually comparing 2 very different algorithms.

sklearn examples seems to be doing roughly whitespace splitting with some normalization. huggingface does a BPE encoding algorithm.

The two are vastly different, the first one, will yield quite a bit a "Unk" tokens, or you will have a huge vocabulary size (which means machine learning models will be huge). BPE on the other hand was designed so that words that are not known, are split into parts that should make sense for the language.

Here is an example:

# 'supercool' is not in the train data
vec.vocabulary_['supercool']
# KeyError: 'supercool'

# On the other hand the BPE algorithm manages to split this word into parts
tokenizer.encode('supercool').tokens
#  ['▁super', 'c', 'ool']

I hope that explains the observed differences. That does not prevent us from finding more optimisations in the future to bring that down even further.

@traboukos I hope your example falls in a similar category (or debug mode problem I mentionned), but it's hard to say without access to your tokenizer or data file or ones that can reproduce the problem (for instance BPE algorithm is known to be notably slow on languages that don't include whitespaces).

Sep 15 '20 08:09 Narsil

Hi @Narsil, many thanks for your explanations! I'm actually aware of the differences in the algorithms. My question was mainly whether the method I'm currently using is the fastest way to get the tokens using a trained tokenizer or whether there is a more efficient way.

If this is already the best that is possible at the moment, I'm fine with that.

Sep 15 '20 09:09 sobayed

Well you can have some speedups if you use encode_batch instead of encode as you can be able to use parallelization.

tokenizer.encode_batch([text, text, ....])

But it depends on your use case as depending on your code you might get a dead lock if you are already using threading in Python: #311

To actually get the speedup you need to set TOKENIZERS_PARALLELISM=1 mycommand.py, you should have a warning otherwise.

Sep 15 '20 09:09 Narsil

For my current use case, batch encoding is unfortunately not an option. Still good to know about it though for the future! Many thanks again for your help 👍

Sep 15 '20 09:09 sobayed

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 16 '24 01:05 github-actions[bot]

tokenizers tokenizers copied to clipboard

Tokenizer encode very slow

tokenizers
tokenizers copied to clipboard