text-embeddings-inference icon indicating copy to clipboard operation
text-embeddings-inference copied to clipboard

tokenizer.json required for TEI?

Open nbroad1881 opened this issue 1 year ago • 6 comments

System Info

When trying to use this model (ibm/re2g-reranker-trex) in TEI, it will error because there is no tokenizer.json file. If I call AutoTokenizer.from_pretrained("ibm/re2g-reranker-trex"), there aren't any issues creating the tokenizer.

I have opened a pull request on the model page to include the tokenizer.json file, but I'm wondering if something should/could be done on the TEI side.

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

model=ibm/re2g-reranker-trex 
volume=$PWD/data 

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model

Expected behavior

No error

nbroad1881 avatar Feb 23 '24 17:02 nbroad1881

@OlivierDehaene ,

This may be an issue with older models on the hub both for the tokenizer and the config.json.

Older Bert models won't have a tokenizer.json file.

SequenceClassification models won't have num_labels, id2label, or label2id in config.json

Should TEI be able to handle these cases, or is it up to the user to create a PR to include these new files?

nbroad1881 avatar Feb 29 '24 23:02 nbroad1881

SequenceClassification models won't have num_labels, id2label, or label2id in config.json

Do you have an example?

Should TEI be able to handle these cases, or is it up to the user to create a PR to include these new files?

For tokenizer.json, TEI will not be able to replace it. For the other case I'm not sure and would like to explore the examples to figure it out.

OlivierDehaene avatar Mar 01 '24 16:03 OlivierDehaene

I had to make a pull request on this model to get it working with TEI: https://huggingface.co/ibm/re2g-reranker-nq

nbroad1881 avatar Mar 01 '24 17:03 nbroad1881

On second glance, this might be an anomaly. Other older models seem fine:

https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment/blob/main/config.json https://huggingface.co/jb2k/bert-base-multilingual-cased-language-detection/blob/main/config.json

nbroad1881 avatar Mar 01 '24 17:03 nbroad1881

I have the same issue with not having tokenizer.json with old models. Is there any work around for we to have "tokenizer.json". As far as I know this is from FastTokenizer class? https://huggingface.co/docs/transformers/en/fast_tokenizers

AndrewNgo-ini avatar Mar 07 '24 14:03 AndrewNgo-ini

SequenceClassification models won't have num_labels, id2label, or label2id in config.json

Do you have an example?

@OlivierDehaene How about this : https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco/blob/main/config.json

Attempts to run this model with TEI yields the following error:

Error: `config.json` does not contain `id2label`

For info, below is the command that i ran:

model=amberoad/bert-multilingual-passage-reranking-msmarco
volume=$PWD/models

docker run -p 8088:80 -v $volume:/models --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.1.0 --model-id $model

w3iw3i avatar Mar 13 '24 02:03 w3iw3i