djl
djl copied to clipboard
Tokenizer.json compability with jni rust tokenizers - data did not match any variant of untagged enum
Description
The python/rust upstream transformer tokenizer save_pretrained function adds a new key on the model level in the tokenizer.json configuration. model.byte_fallback which causes an exception when calling the native createTokenizerFromString. Maybe related to using a older rust version of the transformer tokenizers?
Expected Behavior
Able to load the tokenizer from a tokenizer.json file.
Error Message
Caused by: java.lang.RuntimeException:
data did not match any variant of untagged enum PreTokenizerWrapper at line 73 column 3
at ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.createTokenizerFromString(Native Method)
How to Reproduce?
- Install a recent version of the
transformerslibrary
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-small')
tokenizer.save_pretrained("saved")
Attempt to load the saved tokenizer.json file with 0.27.0 using HuggingFaceTokenizer.newInstance
@jobergum
I confirmed your issue. Will try to upgrade to 0.19.1. For the mean time, you can use:
HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance("intfloat/multilingual-e5-small");
Thank you for the swift reply! Yes, using pre-existing tokenizer files works great, but if people do any type of changes and saves the tokenizer file, it breaks.