keras-nlp Add `oov_token` Argument to `BytePairTokenizer`

Add `oov_token` Argument to `BytePairTokenizer`

Open abuelnasr0 opened this issue 1 year ago • 1 comments

The <unk> token is not really used by the BytePairTokenizer, instead oov tokens will be mapped to -1, That will cause index error for embedding layer. This will only occur in the case where vocabulary is limited -doesn't contain all the bytes- for example when trying an example with custom small vocabulary rather than using a preset, but adding this feature will be better.

Feb 24 '24 23:02 abuelnasr0

keras-nlp keras-nlp copied to clipboard

Add `oov_token` Argument to `BytePairTokenizer`

keras-nlp
keras-nlp copied to clipboard