djl
djl copied to clipboard
Byte-Pair Encoding tokenizer
Description
I would like to use a model from huggingface that uses a BPE tokenizer. This model is flaubert/flaubert_base_uncased.
It can be found at: https://huggingface.co/flaubert/flaubert_base_uncased
It doesn't have a tokenizer.json, so the djl huggingface tokenizers won't work with it.
- Is there any other way to build this kind of tokenizer with djl?
- Do you think it would be possible to add some method to build the tokenizer file using the vocabulary.json and the merges.txt in the huggingface wrapper?
Will this change the current api? How?
Maybe it could be just a new method in the HuggingFaceTokenizer to create a tokenizer with the 2 files?
Who will benefit from this enhancement?
anyone using byte-pair encoding tokenizer
References
- https://huggingface.co/docs/transformers/main/en/tokenizer_summary#byte-pair-encoding
- https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/model.rs
@larochef Yes, we should be able to expose BPE api in our huggingface tokenizer extension.
For the mean time, you should be able to use our sentencepiece extension, it has BPE implementation.
I've had a look at sentencepiece, but it feels like I won't be able to easily reuse it as-is, since the training data seem to somewhat differ.
I've also had a look at the rust binding, and it feels like it shouldn't be too hard to add this support, if it's ok, I'll gladly propose something for it, mirroring the way they do it, with a builder
I have pushed some MR for it, and here are a few points:
- I didn't manage to get the builder work, I have faced some memory move issues the I didn't know how to fix properly. I guess it is somewhat linked to what is done in the
cast_handlemethod that doesn't play nice with the builder. - For some reason, I can't get the jni link work in unit tests, there should be something I'm missing, but I don't really see what
Tell me what you think of it, and let's see how we can get it fully working!