Add `XLNetTokenizer`
Adds XLNET Tokenizer to the library. The part 2 of adding the xlnet to keras-nlp.
This PR is ready for review.
cc : @mattdangerw
Hi @mattdangerw, I have made some changes in this commit -
- all string related operations are now replaced with their respective
tensorflow-textand thetf.stringsmodule. - added a new
postprocessmethod which takes care of all the complex logic. Since we must keep those inner loops as I have mentioned here, I usedtf.py_functionto wrap the python function. Now the tokenizer is working with theds.map(tokenizer)! - I have also added a test
test_tokenize_dswhich checks if the tokenizer is working with thetf.data.Datasetor not.
I believe in this way we can have the workaround for digits followed by a comma along with the possibility of using the tokenizer with tf.data.Dataset.
Please review it and let me know if the code style compiles with the library or not.
Will need to step through this more carefully soon, but at first blush this looks like it would probably be quite inefficient. py_function is quite possible going to be slower then doing this all in python, because of the frequent conversion of types, and dipping in an out of the python interpreter. There is also a lot of going from string -> int -> string.
We should probably think about this a little more.
What happens if you just preprocess any 123, -> 123 ,?
What happens if you just preprocess any 123, -> 123 ,?
If we don't use the logic and only use tokenizer._sentence_piece.tokenize, we will get ['▁12', '3,'].
The real output is ['▁12', '3', ',']
@susnato , Is this PR still relevant, if yes, could you please update the status of this PR. Thanks