keras-nlp
keras-nlp copied to clipboard
Support tokenization of special tokens for word_piece_tokenizer
this PR enables the user to tokenize special tokens from text input as was suggested in keras-team/keras-nlp#1395.
The behaviour is the same as ByteBairTokenizer I think. also the models' tokenizers that use WordPieceTokenizer have the same behaviour as the models' tokenizers that use BytePairTokenizer.
Thanks! This is a great feature.
I will think more on this next week. I like the overall approach of doing this on top of tf-text instead of inside the tf-text ops. But we may want to think about this more generally. Something like this as an end state...
- All tokenizer have a
special_tokens_in_strings=Falseargument. IfTrue, input strings can contain special tokens, which will be handled correctly. - For pretrained tokenizers, this is setup automatically to respect the model's special tokens.
- For a base tokenizer, a special token list can be provided by the user.
- We do this eventually for all subword tokenizers--sentencepiece, byte pair, and word piece.
Main question here for me... Can we guarantee for all tokenizer types that this will work without messing with the "black box" tf-text op? How?
@abuelnasr0 let me know what you think! Once we have an overall plan, landing this incrementally (e.g. starting with word piece as you are here), sounds good!
@mattdangerw
All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.
Do you think special_tokens_in_strings would better be a call argument in tokenize method or initialization argument?
Main question here for me... Can we guarantee for all tokenizer types that this will work without messing with the "black box" tf-text op? How?
For byte pair and wordpiece (by this PR), I think that is guaranteed. For sentencepiece, I am currently writing a Gist that will clarify how I will add that feature to sentencepiece. It guarantees that tf-text will not be touched. I will send the link here after I finish it. may be tomorrow.
@mattdangerw checkout this two PR keras-team/keras-nlp#1445 (support for sentencepiece) and keras-team/keras-nlp#1447 (that fixes a small nit in the tokenizer but we can use the same PR to make the behaviour similar for all)
For Sentencepiece, I named the argument special_tokens instead of unsplittable_tokens because there is really no splitting before tokenization in Sentencepiece. I named it also special_tokens for WordPiece (this PR) but it can be named unsplittable_tokens as there's splitting before tokenization. for the BytePair tokenizer it must be unsplittable_tokens because special_tokens will make a conflict with WhisperTokenizer as it uses special_tokens keyword.
the point is can the argument be named:
special_tokensfor Sentencepiecespecial_tokensorunsplittable_tokensfor WordPieceunsplittable_tokensfor BytePair Or we should just name itunsplittable_tokensfor Sentencepiece even if it's not precise naming.
All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.
Regarding special_tokens_in_strings that you suggested, I think it has no importance if it will be used as construction argument because we are passing special_tokens list. and if user passes a special_tokens list that means that there is a special tokens in the strings that need to be handled.
But if you mean a call argument then it will be useful, it will give the user the ability to tokenize inputs without caring about special tokens even if a special_tokens list was passed during construction.
Does that make sense?
@mattdangerw Indeed! Thanks for that, it sounds rational, and it's a point of view that I was not aware of. I will address that for WordPieceTokenizer and BytePairTokenizer.
@mattdangerw Check out last changes. I have add special_tokens_in_strings Arg for base tokenizer and model tokenizers. If all good, I will go ahead and do the same for BytePairTokenizer
The default behaviour of BytePairTokenizer will change. where BytePairTokenizer handles special tokens correctly by default now, but I will change it to not handle special tokens correctly and only when special_tokens_in_strings is True, it should tokenize special tokens correctly.
Thanks! Will take a look shortly!
The default behaviour of BytePairTokenizer will change. where BytePairTokenizer handles special tokens correctly by default now, but I will change it to not handle special tokens correctly and only when special_tokens_in_strings is True
Yeah this ok I think. We would just want to call it out as a breaking change clearly in our next release. Thankfully the fix is quite clear, for anyone who wants the old behavior, pass the option.