BytePair Tokenizer Implementation
An implementation of OpenAI's BytePair encoder in TF compatible graph mode, which would allow for the e2e development of certain pretrained models that use this tokenizer (RoBERTa, GPT etc.).
Currently a rough version, and suggestions are welcome!
Forgot to leave a comment here, but from conversations with @jessechancy
For the cache, we want a way to go from a string word input, to a token list output. We are currently doing that with two different hashtables, to go from string -> int, int -> string, to workaround an issue where tf does not offer a string -> string lookup.
I think we could have a slightly simpler workaround, where we hash the input string, and then do a int -> string lookup to get the tokenized form. That should save us one of these hash tables, which should make things simpler, faster and lower memory usage.
Another issue we need to work though is that python regex and tf regex appear to handle certain whitespace characters--non breaking spaces. We need to fix this, probably with some regex hacking.
@mattdangerw Do we still have unresolved issues on functionality of this implementation? I played around with it a bit more and the functionality looks correct to me (compared with RoBERTa's tokenizer). I remember you mentioned there are some special token handled differently?
One concern I have is the implementation is quite complex, but after going through the code I don't know where we can really simplify it. Compared to the fairseq implementation, the additional complexity mainly comes from tensor manipulation when we do merge/check/etc, and how hash and while-loop are handled. We have two approaches here:
- Just use the python implementation and disregard the graph support, which seems to going against our design pattern.
- Use Jesse's implementation, which is with our design pattern, but the code is quite heavy.
@chenmoneygithub left a few brief comments above in that regard.
The issue with different output is apparently with non-breaking space characters.
And there are some things I would like to try re simplifying, in particular removing one of the hash tables by using a hash function (https://github.com/keras-team/keras-nlp/pull/303#issuecomment-1234824157).
Overall I think we should probably go with this and not the python implementation. IIUC the py_function escape hatch to tf.data will be prohibitively slow. But we do need some digging and cleanup on this PR before we can land it.
@mattdangerw Simply wrapping by py_function has tons of runtime errors, the alternative to this PR is not supporting tf.data pipeline, e.g., HuggingFace Roberta TF model, which actually won't cause performance loss based on what Jesse told me earlier, but the downside is it breaks our contract of supporting tf.data. My thought is the same tho, we should check in this implementation, just to clarify here.
Do you remember what output diff you saw earlier? Jesse found one minor diff before his presentation, but he said he had fixed it.
@mattdangerw, @chenmoneygithub - a minor comment here. The merges.txt file present in the official repo (and the HF repo - the HF repo has the same file as the one present in the official repo) has #version: ... on the first line: https://huggingface.co/gpt2/blob/main/merges.txt. I have copied over the same merges.txt file to the GCP folder.
So, we should probably ignore the first line of the text file after https://github.com/keras-team/keras-nlp/pull/303/files#diff-9f7f9b8a01fa1e5d050c27b1dcfdb801ec24d46f724a2eef011c8f86c9bed53aR142, right?
@abheesht17 Thanks for raising it! It's actually fine, because #version: 0.2 won't be performed as a merge rule, because it requires to see a token #version:0.2, but it should be split in the pre-split phase.
Closing, this was landed with some edits on https://github.com/keras-team/keras-nlp/pull/389 by @chenmoneygithub. But huge props on writing this, this is a big deal for the library!!