text icon indicating copy to clipboard operation
text copied to clipboard

`SentencePieceTokenizer` should return IDs _or_ pieces

Open erip opened this issue 3 years ago • 3 comments

🚀 Feature

SentencePieceTokenizer (or similar) should return IDs.

Motivation

Currently using SentencePieceTokenizer as a transform requires a second transform which uses torchtext's Vocab abstraction in order to ID-ify the encoded text. While Vocab is useful in its own right, SentencePieceTokenizer could realistically return ids directly which is a function of a Vocab transform.

Pitch

Fewer transforms means fewer CPU cycles. :-) Additionally, this requires a user to create a Vocab object which may be unnecessary (or needlessly difficult) if they have a pretrained sentencepiece model.

Alternatives

Leave as is.

Additional context

erip avatar Jun 12 '22 16:06 erip

Thanks @erip for the issue. I think this make sense. We do have experimental version of it that return IDs. In other tokenizer transforms (for eg: BERT, CLIP, GPT2BPE tokenizer), we now provide option to either return IDs or tokens. I think we should do the same for SentencePiece as well. Let me do PR to take care or it soon.

parmeet avatar Jun 13 '22 14:06 parmeet

@parmeet what is the status of this feature?

joecummings avatar Aug 31 '22 14:08 joecummings

@parmeet what is the status of this feature?

@joecummings I don't think this feature was ever started. If you'd like, you could take this on as it should be relatively straightforward to implement!

Nayef211 avatar Aug 31 '22 15:08 Nayef211