keras-nlp icon indicating copy to clipboard operation
keras-nlp copied to clipboard

Add `oov_token` Argument to `BytePairTokenizer`

Open abuelnasr0 opened this issue 1 year ago • 1 comments

The <unk> token is not really used by the BytePairTokenizer, instead oov tokens will be mapped to -1, That will cause index error for embedding layer. This will only occur in the case where vocabulary is limited -doesn't contain all the bytes- for example when trying an example with custom small vocabulary rather than using a preset, but adding this feature will be better.

abuelnasr0 avatar Feb 24 '24 23:02 abuelnasr0