text
text copied to clipboard
Fast WordPiece Tokenization
Hello everyone, I want to implement a Fast WordPiece Tokenization algorithm introduced by Google.
Fast WordPiece algortihm
Google introduced a new algorithm called LinMaxMatch for WordPiece tokenization has time complexity O(n). I realized that Pytorch don't have support it yet so I want to implement it. This can be especially useful for mobile platform.
Implementation
I'm using C++ and I have read contributing documentation of Pytoch and source code of TORCHTEXT.DATA.UTILS so I consider two options for this feature. First, implementing this in c10 folder and second one is implement it as Python package. So, I want to know which way is better and appropriate for Pytorch project.
Thank you for any advices and suggestions from you guys.
I don't see a label for torch text
. @ejguan what label do you think this should go under?
I don't see a label for
torch text
. @ejguan what label do you think this should go under?
@bdhirsh We should transfer this issue to TorchText repo.
Hi @minhnhat10, thank you for your proposal. This would certainly be a welcome contribution to torchtext repo :).
Currently, we have sentencepiece and byte-level BPE (used in GPT-2) implemented and binded with python that could act as a starting point to help with the code-base.
I would suggest looking at sentencepiece wrapper and corresponding registration mechanism: using pybind and using torchbind.
cc: @abhinavarora
Hi @parmeet, thank you for your suggestion. I will try that
Hey guys,
I know it's pretty late, but I have implemented a vanilla Python version of the Fast WordPiece Tokenization and look forward to contributing this in TorchText with some help.
https://github.com/lucifermorningstar1305/fast_wordpiece_tokenization
Hey @lucifermorningstar1305. Thanks for reaching out here. Our implementation of the BERT Tokenizer is written in C++ and can be found within these 2 files:
- https://github.com/pytorch/text/blob/main/torchtext/csrc/bert_tokenizer.h
- https://github.com/pytorch/text/blob/main/torchtext/csrc/bert_tokenizer.cpp
You are welcome to contribute the Fast WordPiece Tokenizer in a separate file within that folder. You can find the contribution guidelines for C++ operators here. Before we migrate over to the new tokenizer, we would also want some simple benchmarks showcasing that the new tokenizer is indeed faster than the existing BERTTokenizer. I am happy to review any PRs you make and help with questions about contributing!