text icon indicating copy to clipboard operation
text copied to clipboard

Fast WordPiece Tokenization

Open minhnhat10 opened this issue 3 years ago • 6 comments

Hello everyone, I want to implement a Fast WordPiece Tokenization algorithm introduced by Google.

Fast WordPiece algortihm

Google introduced a new algorithm called LinMaxMatch for WordPiece tokenization has time complexity O(n). I realized that Pytorch don't have support it yet so I want to implement it. This can be especially useful for mobile platform.

Implementation

I'm using C++ and I have read contributing documentation of Pytoch and source code of TORCHTEXT.DATA.UTILS so I consider two options for this feature. First, implementing this in c10 folder and second one is implement it as Python package. So, I want to know which way is better and appropriate for Pytorch project.

Thank you for any advices and suggestions from you guys.

minhnhat10 avatar Dec 18 '21 12:12 minhnhat10

I don't see a label for torch text. @ejguan what label do you think this should go under?

bdhirsh avatar Dec 20 '21 15:12 bdhirsh

I don't see a label for torch text. @ejguan what label do you think this should go under?

@bdhirsh We should transfer this issue to TorchText repo.

ejguan avatar Dec 20 '21 15:12 ejguan

Hi @minhnhat10, thank you for your proposal. This would certainly be a welcome contribution to torchtext repo :).

Currently, we have sentencepiece and byte-level BPE (used in GPT-2) implemented and binded with python that could act as a starting point to help with the code-base.

I would suggest looking at sentencepiece wrapper and corresponding registration mechanism: using pybind and using torchbind.

cc: @abhinavarora

parmeet avatar Dec 20 '21 15:12 parmeet

Hi @parmeet, thank you for your suggestion. I will try that

minhnhat10 avatar Dec 21 '21 16:12 minhnhat10

Hey guys,

I know it's pretty late, but I have implemented a vanilla Python version of the Fast WordPiece Tokenization and look forward to contributing this in TorchText with some help.

https://github.com/lucifermorningstar1305/fast_wordpiece_tokenization

lucifermorningstar1305 avatar Apr 11 '23 22:04 lucifermorningstar1305

Hey @lucifermorningstar1305. Thanks for reaching out here. Our implementation of the BERT Tokenizer is written in C++ and can be found within these 2 files:

  • https://github.com/pytorch/text/blob/main/torchtext/csrc/bert_tokenizer.h
  • https://github.com/pytorch/text/blob/main/torchtext/csrc/bert_tokenizer.cpp

You are welcome to contribute the Fast WordPiece Tokenizer in a separate file within that folder. You can find the contribution guidelines for C++ operators here. Before we migrate over to the new tokenizer, we would also want some simple benchmarks showcasing that the new tokenizer is indeed faster than the existing BERTTokenizer. I am happy to review any PRs you make and help with questions about contributing!

Nayef211 avatar Apr 12 '23 00:04 Nayef211