keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Support tokenization of special tokens for word_piece_tokenizer

Open abuelnasr0 opened this issue 1 year ago • 6 comments

this PR enables the user to tokenize special tokens from text input as was suggested in keras-team/keras-nlp#1395. The behaviour is the same as ByteBairTokenizer I think. also the models' tokenizers that use WordPieceTokenizer have the same behaviour as the models' tokenizers that use BytePairTokenizer.

abuelnasr0 avatar Jan 09 '24 21:01 abuelnasr0

Thanks! This is a great feature.

I will think more on this next week. I like the overall approach of doing this on top of tf-text instead of inside the tf-text ops. But we may want to think about this more generally. Something like this as an end state...

  • All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.
  • For pretrained tokenizers, this is setup automatically to respect the model's special tokens.
  • For a base tokenizer, a special token list can be provided by the user.
  • We do this eventually for all subword tokenizers--sentencepiece, byte pair, and word piece.

Main question here for me... Can we guarantee for all tokenizer types that this will work without messing with the "black box" tf-text op? How?

@abuelnasr0 let me know what you think! Once we have an overall plan, landing this incrementally (e.g. starting with word piece as you are here), sounds good!

mattdangerw avatar Jan 15 '24 05:01 mattdangerw

@mattdangerw

All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.

Do you think special_tokens_in_strings would better be a call argument in tokenize method or initialization argument?

Main question here for me... Can we guarantee for all tokenizer types that this will work without messing with the "black box" tf-text op? How?

For byte pair and wordpiece (by this PR), I think that is guaranteed. For sentencepiece, I am currently writing a Gist that will clarify how I will add that feature to sentencepiece. It guarantees that tf-text will not be touched. I will send the link here after I finish it. may be tomorrow.

abuelnasr0 avatar Feb 13 '24 21:02 abuelnasr0

@mattdangerw checkout this two PR keras-team/keras-nlp#1445 (support for sentencepiece) and keras-team/keras-nlp#1447 (that fixes a small nit in the tokenizer but we can use the same PR to make the behaviour similar for all)

For Sentencepiece, I named the argument special_tokens instead of unsplittable_tokens because there is really no splitting before tokenization in Sentencepiece. I named it also special_tokens for WordPiece (this PR) but it can be named unsplittable_tokens as there's splitting before tokenization. for the BytePair tokenizer it must be unsplittable_tokens because special_tokens will make a conflict with WhisperTokenizer as it uses special_tokens keyword. the point is can the argument be named:

  • special_tokens for Sentencepiece
  • special_tokens or unsplittable_tokens for WordPiece
  • unsplittable_tokens for BytePair Or we should just name it unsplittable_tokens for Sentencepiece even if it's not precise naming.

All tokenizer have a special_tokens_in_strings=False argument. If True, input strings can contain special tokens, which will be handled correctly.

Regarding special_tokens_in_strings that you suggested, I think it has no importance if it will be used as construction argument because we are passing special_tokens list. and if user passes a special_tokens list that means that there is a special tokens in the strings that need to be handled. But if you mean a call argument then it will be useful, it will give the user the ability to tokenize inputs without caring about special tokens even if a special_tokens list was passed during construction.

abuelnasr0 avatar Feb 20 '24 20:02 abuelnasr0

Does that make sense?

@mattdangerw Indeed! Thanks for that, it sounds rational, and it's a point of view that I was not aware of. I will address that for WordPieceTokenizer and BytePairTokenizer.

abuelnasr0 avatar Mar 08 '24 00:03 abuelnasr0

@mattdangerw Check out last changes. I have add special_tokens_in_strings Arg for base tokenizer and model tokenizers. If all good, I will go ahead and do the same for BytePairTokenizer The default behaviour of BytePairTokenizer will change. where BytePairTokenizer handles special tokens correctly by default now, but I will change it to not handle special tokens correctly and only when special_tokens_in_strings is True, it should tokenize special tokens correctly.

abuelnasr0 avatar Mar 13 '24 22:03 abuelnasr0

Thanks! Will take a look shortly!

The default behaviour of BytePairTokenizer will change. where BytePairTokenizer handles special tokens correctly by default now, but I will change it to not handle special tokens correctly and only when special_tokens_in_strings is True

Yeah this ok I think. We would just want to call it out as a breaking change clearly in our next release. Thankfully the fix is quite clear, for anyone who wants the old behavior, pass the option.

mattdangerw avatar Mar 13 '24 22:03 mattdangerw