djl icon indicating copy to clipboard operation
djl copied to clipboard

Tokenizer.json compability with jni rust tokenizers - data did not match any variant of untagged enum

Open jobergum opened this issue 1 year ago • 2 comments

Description

The python/rust upstream transformer tokenizer save_pretrained function adds a new key on the model level in the tokenizer.json configuration. model.byte_fallback which causes an exception when calling the native createTokenizerFromString. Maybe related to using a older rust version of the transformer tokenizers?

Expected Behavior

Able to load the tokenizer from a tokenizer.json file.

Error Message

Caused by: java.lang.RuntimeException: 
data did not match any variant of untagged enum PreTokenizerWrapper at line 73 column 3
	at ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.createTokenizerFromString(Native Method)

How to Reproduce?

  1. Install a recent version of the transformers library
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-small')
tokenizer.save_pretrained("saved")

Attempt to load the saved tokenizer.json file with 0.27.0 using HuggingFaceTokenizer.newInstance

jobergum avatar Apr 30 '24 07:04 jobergum

@jobergum

I confirmed your issue. Will try to upgrade to 0.19.1. For the mean time, you can use:

HuggingFaceTokenizer tokenizer = HuggingFaceTokenizer.newInstance("intfloat/multilingual-e5-small");

frankfliu avatar Apr 30 '24 15:04 frankfliu

Thank you for the swift reply! Yes, using pre-existing tokenizer files works great, but if people do any type of changes and saves the tokenizer file, it breaks.

jobergum avatar Apr 30 '24 16:04 jobergum