infinity
infinity copied to clipboard
Get null embedding
System Info
infinity 0.0.53 OS version: linux Model being used: dunzhang/stella_en_1.5B_v5 Hardware used: NVIDIA A100
Information
- [ ] Docker
- [X] The CLI directly via pip
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
I run following Command: infinity_emb v2 --model-id dunzhang/stella_en_1.5B_v5 --port 3002 --trust-remote-code --served-model-name embedding Then i use /embeddings api: { "input": [ "5.2" ], "model": "embedding" } I got list of null value of embedding, i try some other model and they are not return null value like this model
Expected behavior
This should return list of float value
Can you post the full logs?
Can you post the full logs?
Here my full log
INFO 2024-08-01 01:49:48,998 datasets INFO: PyTorch version 2.3.1 available. config.py:58
INFO: Started server process [116201]
INFO: Waiting for application startup.
INFO 2024-08-01 01:49:50,366 infinity_emb INFO: model=/cloudata/thainq/models/models/stella_en_1.5B_v5/ selected, using engine=torch and device=None select_model.py:57
INFO 2024-08-01 01:49:50,369 sentence_transformers.SentenceTransformer INFO: Use pytorch device_name: cuda SentenceTransformer.py:189
INFO 2024-08-01 01:49:50,369 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer: /cloudata/thainq/models/models/stella_en_1.5B_v5/ SentenceTransformer.py:197
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 2024-08-01 01:49:56,722 sentence_transformers.SentenceTransformer INFO: 2 prompts are loaded, with the keys: ['s2p_query', 's2s_query'] SentenceTransformer.py:326
INFO 2024-08-01 01:49:56,925 infinity_emb INFO: Adding optimizations via Huggingface optimum. acceleration.py:46
WARNING 2024-08-01 01:49:56,927 infinity_emb WARNING: BetterTransformer is not available for model: <class 'transformers_modules.modeling_qwen.Qwen2Model'> Continue without bettertransformer modeling code. acceleration.py:57
INFO 2024-08-01 01:49:56,928 infinity_emb INFO: Switching to half() precision (cuda: fp16). sentence_transformer.py:81
INFO 2024-08-01 01:49:57,812 infinity_emb INFO: Getting timings for batch_size=32 and avg tokens per sentence=2 select_model.py:80
7.07 ms tokenization
27.29 ms inference
0.13 ms post-processing
34.49 ms total
embeddings/sec: 927.83
INFO 2024-08-01 01:49:58,583 infinity_emb INFO: Getting timings for batch_size=32 and avg tokens per sentence=512 select_model.py:86
16.83 ms tokenization
338.24 ms inference
0.39 ms post-processing
355.46 ms total
embeddings/sec: 90.03
INFO 2024-08-01 01:49:58,586 infinity_emb INFO: model warmed up, between 90.03-927.83 embeddings/sec at batch_size=32 select_model.py:87
INFO 2024-08-01 01:49:58,588 infinity_emb INFO: creating batching engine batch_handler.py:321
INFO 2024-08-01 01:49:58,589 infinity_emb INFO: ready to batch requests. batch_handler.py:384
INFO 2024-08-01 01:49:58,593 infinity_emb INFO: infinity_server.py:63
♾️ Infinity - Embedding Inference Server
MIT License; Copyright (c) 2023 Michael Feil
Version 0.0.53
Open the Docs via Swagger UI:
http://0.0.0.0:3002/docs
Access model via 'GET':
curl http://0.0.0.0:3002/models
INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:3002 (Press CTRL+C to quit) INFO: 172.16.250.26:50758 - "POST /embeddings HTTP/1.1" 200 OK
To use this package, you need ensure that your model is compatible with https://github.com/UKPLab/sentence-transformers. Is it compatible?
You are using cloudata/thainq/models/models a local model. This is not officially supported / do at your own risk. Reason is, that you would need to have all files (config.json, model.safetensors, sentence_transformers.config, sentence-transformers config layers). This might be the source of the bug.
To use this package, you need ensure that your model is compatible with https://github.com/UKPLab/sentence-transformers. Is it compatible?
Yes i'm also run this model on sentence-transformers and it work well
Did you try it with the docker container?
Did you try it with the docker container?
I'm not try docker yet, but i notice every model use QWEN2 return list embedding null with my above input (gt-qwen2-1.5, gte-qwen2-7), is it a problems ?
Also seeing this issue with Qwen based models, https://huggingface.co/dunzhang/stella_en_400M_v5 and the 1.5B variant both have this problem.
Can confirm that we've also had this issue with dunzhang/stella_en_1.5B_v5, running in Docker.
Problem solved. Run model with float32
@brannt @ptkenny @MLlove0402 This is solved. The model leads to torch.nan this is dependent on the input length etc.
Models like dunzhang/stella_en_1.5B_v5 are best to be used with --dtype float32 or --dtype bfloat16 --device cuda
Hi @michaelfeil, thank you so much for your responses! Your response on another issue helped me figure out that I needed to use the fa image, and now the float32 dtype.
Qwen models are pretty quirky, I wonder why
@TheOnlyWayUp I am releasing soon a new ci that automatically includes the flash attention and onnx-gpu packages.
The qwen models should work best with bfloat16. bfloat16 is as fast as float16, but runs with the accuracy of float32.