text
text copied to clipboard
`SentencePieceTokenizer` should return IDs _or_ pieces
🚀 Feature
SentencePieceTokenizer (or similar) should return IDs.
Motivation
Currently using SentencePieceTokenizer as a transform requires a second transform which uses torchtext's Vocab abstraction in order to ID-ify the encoded text. While Vocab is useful in its own right, SentencePieceTokenizer could realistically return ids directly which is a function of a Vocab transform.
Pitch
Fewer transforms means fewer CPU cycles. :-) Additionally, this requires a user to create a Vocab object which may be unnecessary (or needlessly difficult) if they have a pretrained sentencepiece model.
Alternatives
Leave as is.
Additional context
Thanks @erip for the issue. I think this make sense. We do have experimental version of it that return IDs. In other tokenizer transforms (for eg: BERT, CLIP, GPT2BPE tokenizer), we now provide option to either return IDs or tokens. I think we should do the same for SentencePiece as well. Let me do PR to take care or it soon.
@parmeet what is the status of this feature?
@parmeet what is the status of this feature?
@joecummings I don't think this feature was ever started. If you'd like, you could take this on as it should be relatively straightforward to implement!