keras-nlp
keras-nlp copied to clipboard
WordPieceTokenizer token splitting
This is not necessarily a bug, but I find it confusing.
I tried to tokenize a sequence like "[start] have a nice day", but it appears that with the default setup, "[start]" will be split into 3 tokens even if the vocab contains "[start]". I tried to play around with keep_pattern - I tried keep_pattern=r"\[\]" and some other approaches but got any luck. So my question is how to ignore certain characters as splitters? Also maybe we can improve the docstring to clearly show the approach?
keep pattern is the regex of tokens to keep that you split on
split pattern is the regex of tokens to split on
sounds like you would like to split differently (e.g. do not split on brackets), so you would need to change the split_pattern argument
But overall, given the overall use case you are talking about, it is probably be easier to just pad with a start token after tokenization. Then there's no caveats of messing up translation around real source text that contains brackets, or the string literal [start].
Thanks! I am not very familiar with Regex, is it possible to set split pattern as (default split pattern - "[" - "]")?
Hmm, I'm not sure we would want to support people doing math on our default regex pattern, that would be a compat nightmare. Something like this would work
split_pattern="\s|[!-/:-@^-`{-~]",
keep_pattern="[!-/:-@^-`{-~]",
But overall I think we should discourage something like this as a way to add start and end tokens. We should probably try to make sure our guides and examples show adding a start and end token after tokenization, which would be much less hacky.
We could try as future work some way to specify a set of special characters. So "The cat walked down the [MASK]. [PAD] [PAD]" would tokenize as you would expect. But probably not everyone would want this, and I don't believe we have the op level support for something like that yet, it would be something we need to collab with tf text for.