keras-nlp Add `XLNetTokenizer`

Adds XLNET Tokenizer to the library. The part 2 of adding the xlnet to keras-nlp.

Aug 09 '23 18:08 susnato

This PR is ready for review.

cc : @mattdangerw

Aug 10 '23 09:08 susnato

Hi @mattdangerw, I have made some changes in this commit -

all string related operations are now replaced with their respective tensorflow-text and the tf.strings module.
added a new postprocess method which takes care of all the complex logic. Since we must keep those inner loops as I have mentioned here, I used tf.py_function to wrap the python function. Now the tokenizer is working with the ds.map(tokenizer)!
I have also added a test test_tokenize_ds which checks if the tokenizer is working with the tf.data.Dataset or not.

I believe in this way we can have the workaround for digits followed by a comma along with the possibility of using the tokenizer with tf.data.Dataset.

Please review it and let me know if the code style compiles with the library or not.

Aug 12 '23 11:08 susnato

Will need to step through this more carefully soon, but at first blush this looks like it would probably be quite inefficient. py_function is quite possible going to be slower then doing this all in python, because of the frequent conversion of types, and dipping in an out of the python interpreter. There is also a lot of going from string -> int -> string.

We should probably think about this a little more.

What happens if you just preprocess any 123, -> 123 ,?

Aug 15 '23 06:08 mattdangerw

What happens if you just preprocess any 123, -> 123 ,?

If we don't use the logic and only use tokenizer._sentence_piece.tokenize, we will get ['▁12', '3,']. The real output is ['▁12', '3', ',']

Aug 16 '23 05:08 susnato

@susnato , Is this PR still relevant, if yes, could you please update the status of this PR. Thanks

Oct 15 '25 18:10 sachinprasadhs