keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Issue with `BytePairTokenizer`

Open abuelnasr0 opened this issue 1 year ago • 0 comments

Describe the bug

currently BytePairTokenizer supports tokenization of special tokens. but when it have two special tokens that have the same characters inside them like <s> and </s>, the tokenizer will tokenize both of them to the id of the token which appears first in the unsplittable_tokens list that was passed during initialization. That is because of the way of the way BytePairTokenizer handles special tokens. It creates alts for the tokens before splitting special charcters to avoid splitting special tokens but it creates same alt for <s> and </s> which is Ĵs

To Reproduce

here is a Gist: https://colab.research.google.com/gist/abuelnasr0/3112279a3e0108cad1862a70645a406f/bytebairtokenizer_bug.ipynb

Expected behavior

to tokenize each special token to its id

Additional context

I can open a PR to fix this bug. also this is related to keras-team/keras-nlp#1397

abuelnasr0 avatar Feb 13 '24 21:02 abuelnasr0