MLServer icon indicating copy to clipboard operation
MLServer copied to clipboard

Install `fugashi`, `unidic`, `unidic-lite`, and `ipadic` as dependencies to MLServer HuggingFace to support hosting Japanese language models

Open jbauer2718 opened this issue 1 year ago • 3 comments

Because of the fact that Japanese mixes phonetic scripts and Chinese characters, special algorithms and dictionaries are needed to run tokenizers for these these models. A popular example of this is the BERT Japanese model:

https://huggingface.co/transformers/v4.11.3/_modules/transformers/models/bert_japanese/tokenization_bert_japanese.html

Without these dependencies, mlserver_huggingface/common.py errors when trying to load the tokenizer in the pipeline.

To reproduce, use any Japanese model. Here is an example.

jbauer2718 avatar Dec 08 '23 18:12 jbauer2718

If someone adds me as a contributor, I am happy to fix this issue and write a test for it.

jbauer2718 avatar Dec 08 '23 19:12 jbauer2718

@jbauer2718 many thanks for reporting this issue and offering to fix it. You can create a PR based on changes from your fork and we can look at it.

sakoush avatar Dec 11 '23 08:12 sakoush

Hey @sakoush , just added the above-linked PR for the team's review.

jbauer2718 avatar Dec 12 '23 19:12 jbauer2718