allennlp Do `batch_tokenize` in `PretrainedTransformerTokenizer`

Do `batch_tokenize` in `PretrainedTransformerTokenizer`

Open bryant1410 opened this issue 5 years ago • 7 comments

Given that the transformers library is including faster tokenizers that probably work faster in batches, I think we can implement batch_tokenize in PretrainedTransformerTokenizer so it calls batch_encode_plus.

Feb 20 '20 18:02 bryant1410

Sounds good to me, PR welcome. Though does this have to wait until we update our dependency on transformers?

Feb 20 '20 18:02 matt-gardner

It doesn't, as the interface is the same.

Feb 20 '20 18:02 bryant1410

I'm not convinced that the new tokenizer will realize speed gains in batches, but it's quick to test. I'd want to make sure it's worth it before spending time on this.

Feb 21 '20 19:02 dirkgr

It seems from the code :man_shrugging:

https://github.com/huggingface/tokenizers/blob/11dd6c8baef9ae2b836d594215f14a208dbacfb2/tokenizers/src/tokenizer/mod.rs#L364

Feb 21 '20 19:02 bryant1410

Uhh, multithreaded tokenization. I am mindful of Amdahl's Law, but I also think this is probably worth it then, at least if it comes with no API change.

Feb 21 '20 20:02 dirkgr

It could be a big impact if 1) your whole dataset fits in memory (you can sent chunks also) and 2) you tokenize altogether.

Feb 21 '20 20:02 bryant1410

(and you have many cores...)

Feb 21 '20 20:02 bryant1410

allennlp allennlp copied to clipboard

Do `batch_tokenize` in `PretrainedTransformerTokenizer`

allennlp
allennlp copied to clipboard