text
text copied to clipboard
Add methods defined on SentencePieceProcessor
This PR adds most of methods define in SentencePieceProcessor Python wrapper. ~~Blocked by https://github.com/pytorch/pytorch/pull/38167~~
NBestEncodeAsPiecesNBestEncodeAsIdsSampleEncodeAsPiecesSampleEncodeAsIdsDecodePiecesDecodeIdsGetPieceSizePieceToIdIdToPieceGetScoreIsUnknownIsUnusedIsControlunk_idbos_ideos_idpad_idSetEncodeExtraOptionsSetDecodeExtraOptionsSetVocabularyResetVocabularyLoadVocabulary__len____getitem__
Let us know when you need a review on this
I was wondering why we need to impose all those methods. Probably our users won't expect those methods from our side and I don't see them in pytext.
@zhangguanheng66 Okay I removed. Once it passes the tests, it's ready to merge.
LGTM. thanks @mthrok !
Here are a few python wrappers:
encode_as_piecestokenizes a sentence into a list of tokensencode_as_idstokenizes a sentence into a list of tokens. In general, this wrap could be directly applied to convert a sentence into a tensor that can be sent to the model for training/inferencepiece_to_idis equivalent tostoiintorchtext.vocabclassid_to_pieceis equivalent toitosintorchtext.vocabclassdecode_piecescombines a list of tokens into the original sentencedecode_idscombines a list to token ids into the original sentence
Will look other funcs in cpp side.
@zhangguanheng66 I added simple tests for piece_to_id, id_to_piece, decode_pieces and decode_ids.
(encode_as_pieces and encode_as_ids were already tested.)
Hi @mthrok!
Thank you for your pull request.
We require contributors to sign our Contributor License Agreement, and yours needs attention.
You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!