keras-nlp
keras-nlp copied to clipboard
Issue with `BytePairTokenizer`
Describe the bug
currently BytePairTokenizer supports tokenization of special tokens. but when it have two special tokens that have the same characters inside them like <s> and </s>, the tokenizer will tokenize both of them to the id of the token which appears first in the unsplittable_tokens list that was passed during initialization.
That is because of the way of the way BytePairTokenizer handles special tokens. It creates alts for the tokens before splitting special charcters to avoid splitting special tokens but it creates same alt for <s> and </s> which is Ĵs
To Reproduce
here is a Gist: https://colab.research.google.com/gist/abuelnasr0/3112279a3e0108cad1862a70645a406f/bytebairtokenizer_bug.ipynb
Expected behavior
to tokenize each special token to its id
Additional context
I can open a PR to fix this bug. also this is related to keras-team/keras-nlp#1397