allennlp
allennlp copied to clipboard
Do `batch_tokenize` in `PretrainedTransformerTokenizer`
Given that the transformers library is including faster tokenizers that probably work faster in batches, I think we can implement batch_tokenize in PretrainedTransformerTokenizer so it calls batch_encode_plus.
Sounds good to me, PR welcome. Though does this have to wait until we update our dependency on transformers?
It doesn't, as the interface is the same.
I'm not convinced that the new tokenizer will realize speed gains in batches, but it's quick to test. I'd want to make sure it's worth it before spending time on this.
It seems from the code :man_shrugging:
https://github.com/huggingface/tokenizers/blob/11dd6c8baef9ae2b836d594215f14a208dbacfb2/tokenizers/src/tokenizer/mod.rs#L364
Uhh, multithreaded tokenization. I am mindful of Amdahl's Law, but I also think this is probably worth it then, at least if it comes with no API change.
It could be a big impact if 1) your whole dataset fits in memory (you can sent chunks also) and 2) you tokenize altogether.
(and you have many cores...)