keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

WordPieceTokenizer token splitting

Open chenmoneygithub opened this issue 2 years ago • 4 comments

This is not necessarily a bug, but I find it confusing.

I tried to tokenize a sequence like "[start] have a nice day", but it appears that with the default setup, "[start]" will be split into 3 tokens even if the vocab contains "[start]". I tried to play around with keep_pattern - I tried keep_pattern=r"\[\]" and some other approaches but got any luck. So my question is how to ignore certain characters as splitters? Also maybe we can improve the docstring to clearly show the approach?

chenmoneygithub avatar May 14 '22 23:05 chenmoneygithub

keep pattern is the regex of tokens to keep that you split on

split pattern is the regex of tokens to split on

sounds like you would like to split differently (e.g. do not split on brackets), so you would need to change the split_pattern argument

mattdangerw avatar May 16 '22 20:05 mattdangerw

But overall, given the overall use case you are talking about, it is probably be easier to just pad with a start token after tokenization. Then there's no caveats of messing up translation around real source text that contains brackets, or the string literal [start].

mattdangerw avatar May 16 '22 20:05 mattdangerw

Thanks! I am not very familiar with Regex, is it possible to set split pattern as (default split pattern - "[" - "]")?

chenmoneygithub avatar May 16 '22 21:05 chenmoneygithub

Hmm, I'm not sure we would want to support people doing math on our default regex pattern, that would be a compat nightmare. Something like this would work

split_pattern="\s|[!-/:-@^-`{-~]",
keep_pattern="[!-/:-@^-`{-~]",

But overall I think we should discourage something like this as a way to add start and end tokens. We should probably try to make sure our guides and examples show adding a start and end token after tokenization, which would be much less hacky.

We could try as future work some way to specify a set of special characters. So "The cat walked down the [MASK]. [PAD] [PAD]" would tokenize as you would expect. But probably not everyone would want this, and I don't believe we have the op level support for something like that yet, it would be something we need to collab with tf text for.

mattdangerw avatar May 16 '22 23:05 mattdangerw