djl Byte-Pair Encoding tokenizer

Description

I would like to use a model from huggingface that uses a BPE tokenizer. This model is flaubert/flaubert_base_uncased. It can be found at: https://huggingface.co/flaubert/flaubert_base_uncased

It doesn't have a tokenizer.json, so the djl huggingface tokenizers won't work with it.

Is there any other way to build this kind of tokenizer with djl?
Do you think it would be possible to add some method to build the tokenizer file using the vocabulary.json and the merges.txt in the huggingface wrapper?

Will this change the current api? How?

Maybe it could be just a new method in the HuggingFaceTokenizer to create a tokenizer with the 2 files?

Who will benefit from this enhancement?

anyone using byte-pair encoding tokenizer

References

https://huggingface.co/docs/transformers/main/en/tokenizer_summary#byte-pair-encoding
https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/model.rs

Apr 13 '23 16:04 larochef

@larochef Yes, we should be able to expose BPE api in our huggingface tokenizer extension.

For the mean time, you should be able to use our sentencepiece extension, it has BPE implementation.

Apr 13 '23 16:04 frankfliu

I've had a look at sentencepiece, but it feels like I won't be able to easily reuse it as-is, since the training data seem to somewhat differ.

I've also had a look at the rust binding, and it feels like it shouldn't be too hard to add this support, if it's ok, I'll gladly propose something for it, mirroring the way they do it, with a builder

Apr 17 '23 12:04 larochef

I have pushed some MR for it, and here are a few points:

I didn't manage to get the builder work, I have faced some memory move issues the I didn't know how to fix properly. I guess it is somewhat linked to what is done in the cast_handle method that doesn't play nice with the builder.
For some reason, I can't get the jni link work in unit tests, there should be something I'm missing, but I don't really see what

Tell me what you think of it, and let's see how we can get it fully working!

Apr 18 '23 14:04 larochef

djl djl copied to clipboard

Byte-Pair Encoding tokenizer

Description

References

djl
djl copied to clipboard