allennlp icon indicating copy to clipboard operation
allennlp copied to clipboard

Do `batch_tokenize` in `PretrainedTransformerTokenizer`

Open bryant1410 opened this issue 5 years ago • 7 comments

Given that the transformers library is including faster tokenizers that probably work faster in batches, I think we can implement batch_tokenize in PretrainedTransformerTokenizer so it calls batch_encode_plus.

bryant1410 avatar Feb 20 '20 18:02 bryant1410

Sounds good to me, PR welcome. Though does this have to wait until we update our dependency on transformers?

matt-gardner avatar Feb 20 '20 18:02 matt-gardner

It doesn't, as the interface is the same.

bryant1410 avatar Feb 20 '20 18:02 bryant1410

I'm not convinced that the new tokenizer will realize speed gains in batches, but it's quick to test. I'd want to make sure it's worth it before spending time on this.

dirkgr avatar Feb 21 '20 19:02 dirkgr

It seems from the code :man_shrugging:

https://github.com/huggingface/tokenizers/blob/11dd6c8baef9ae2b836d594215f14a208dbacfb2/tokenizers/src/tokenizer/mod.rs#L364

bryant1410 avatar Feb 21 '20 19:02 bryant1410

Uhh, multithreaded tokenization. I am mindful of Amdahl's Law, but I also think this is probably worth it then, at least if it comes with no API change.

dirkgr avatar Feb 21 '20 20:02 dirkgr

It could be a big impact if 1) your whole dataset fits in memory (you can sent chunks also) and 2) you tokenize altogether.

bryant1410 avatar Feb 21 '20 20:02 bryant1410

(and you have many cores...)

bryant1410 avatar Feb 21 '20 20:02 bryant1410