text-embeddings-inference
text-embeddings-inference copied to clipboard
tokenizer.json required for TEI?
System Info
When trying to use this model (ibm/re2g-reranker-trex) in TEI, it will error because there is no tokenizer.json file. If I call AutoTokenizer.from_pretrained("ibm/re2g-reranker-trex")
, there aren't any issues creating the tokenizer.
I have opened a pull request on the model page to include the tokenizer.json file, but I'm wondering if something should/could be done on the TEI side.
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
model=ibm/re2g-reranker-trex
volume=$PWD/data
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model
Expected behavior
No error
@OlivierDehaene ,
This may be an issue with older models on the hub both for the tokenizer and the config.json.
Older Bert models won't have a tokenizer.json
file.
SequenceClassification models won't have num_labels
, id2label
, or label2id
in config.json
Should TEI be able to handle these cases, or is it up to the user to create a PR to include these new files?
SequenceClassification models won't have num_labels, id2label, or label2id in config.json
Do you have an example?
Should TEI be able to handle these cases, or is it up to the user to create a PR to include these new files?
For tokenizer.json, TEI will not be able to replace it. For the other case I'm not sure and would like to explore the examples to figure it out.
I had to make a pull request on this model to get it working with TEI: https://huggingface.co/ibm/re2g-reranker-nq
On second glance, this might be an anomaly. Other older models seem fine:
https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment/blob/main/config.json https://huggingface.co/jb2k/bert-base-multilingual-cased-language-detection/blob/main/config.json
I have the same issue with not having tokenizer.json with old models. Is there any work around for we to have "tokenizer.json". As far as I know this is from FastTokenizer class? https://huggingface.co/docs/transformers/en/fast_tokenizers
SequenceClassification models won't have num_labels, id2label, or label2id in config.json
Do you have an example?
@OlivierDehaene How about this : https://huggingface.co/amberoad/bert-multilingual-passage-reranking-msmarco/blob/main/config.json
Attempts to run this model with TEI yields the following error:
Error: `config.json` does not contain `id2label`
For info, below is the command that i ran:
model=amberoad/bert-multilingual-passage-reranking-msmarco
volume=$PWD/models
docker run -p 8088:80 -v $volume:/models --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-1.1.0 --model-id $model