text icon indicating copy to clipboard operation
text copied to clipboard

Add methods defined on SentencePieceProcessor

Open mthrok opened this issue 4 years ago • 5 comments

This PR adds most of methods define in SentencePieceProcessor Python wrapper. ~~Blocked by https://github.com/pytorch/pytorch/pull/38167~~

  • NBestEncodeAsPieces
  • NBestEncodeAsIds
  • SampleEncodeAsPieces
  • SampleEncodeAsIds
  • DecodePieces
  • DecodeIds
  • GetPieceSize
  • PieceToId
  • IdToPiece
  • GetScore
  • IsUnknown
  • IsUnused
  • IsControl
  • unk_id
  • bos_id
  • eos_id
  • pad_id
  • SetEncodeExtraOptions
  • SetDecodeExtraOptions
  • SetVocabulary
  • ResetVocabulary
  • LoadVocabulary
  • __len__
  • __getitem__

mthrok avatar May 09 '20 03:05 mthrok

Let us know when you need a review on this

cpuhrsch avatar May 21 '20 20:05 cpuhrsch

I was wondering why we need to impose all those methods. Probably our users won't expect those methods from our side and I don't see them in pytext.

@zhangguanheng66 Okay I removed. Once it passes the tests, it's ready to merge.

mthrok avatar May 26 '20 19:05 mthrok

LGTM. thanks @mthrok !

hudeven avatar Jun 08 '20 20:06 hudeven

Here are a few python wrappers:

  • encode_as_pieces tokenizes a sentence into a list of tokens
  • encode_as_ids tokenizes a sentence into a list of tokens. In general, this wrap could be directly applied to convert a sentence into a tensor that can be sent to the model for training/inference
  • piece_to_id is equivalent to stoi in torchtext.vocab class
  • id_to_piece is equivalent to itos in torchtext.vocab class
  • decode_pieces combines a list of tokens into the original sentence
  • decode_ids combines a list to token ids into the original sentence

Will look other funcs in cpp side.

zhangguanheng66 avatar Jun 09 '20 16:06 zhangguanheng66

@zhangguanheng66 I added simple tests for piece_to_id, id_to_piece, decode_pieces and decode_ids. (encode_as_pieces and encode_as_ids were already tested.)

mthrok avatar Jun 09 '20 18:06 mthrok

Hi @mthrok!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Sep 22 '23 01:09 facebook-github-bot