text
text copied to clipboard
Add methods defined on SentencePieceProcessor
This PR adds most of methods define in SentencePieceProcessor Python wrapper. ~~Blocked by https://github.com/pytorch/pytorch/pull/38167~~
-
NBestEncodeAsPieces
-
NBestEncodeAsIds
-
SampleEncodeAsPieces
-
SampleEncodeAsIds
-
DecodePieces
-
DecodeIds
-
GetPieceSize
-
PieceToId
-
IdToPiece
-
GetScore
-
IsUnknown
-
IsUnused
-
IsControl
-
unk_id
-
bos_id
-
eos_id
-
pad_id
-
SetEncodeExtraOptions
-
SetDecodeExtraOptions
-
SetVocabulary
-
ResetVocabulary
-
LoadVocabulary
-
__len__
-
__getitem__
Let us know when you need a review on this
I was wondering why we need to impose all those methods. Probably our users won't expect those methods from our side and I don't see them in pytext.
@zhangguanheng66 Okay I removed. Once it passes the tests, it's ready to merge.
LGTM. thanks @mthrok !
Here are a few python wrappers:
-
encode_as_pieces
tokenizes a sentence into a list of tokens -
encode_as_ids
tokenizes a sentence into a list of tokens. In general, this wrap could be directly applied to convert a sentence into a tensor that can be sent to the model for training/inference -
piece_to_id
is equivalent tostoi
intorchtext.vocab
class -
id_to_piece
is equivalent toitos
intorchtext.vocab
class -
decode_pieces
combines a list of tokens into the original sentence -
decode_ids
combines a list to token ids into the original sentence
Will look other funcs in cpp side.
@zhangguanheng66 I added simple tests for piece_to_id
, id_to_piece
, decode_pieces
and decode_ids
.
(encode_as_pieces
and encode_as_ids
were already tested.)
Hi @mthrok!
Thank you for your pull request.
We require contributors to sign our Contributor License Agreement, and yours needs attention.
You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.
Process
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed
. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at [email protected]. Thanks!