infinity icon indicating copy to clipboard operation
infinity copied to clipboard

Get null embedding

Open MLlove0402 opened this issue 1 year ago • 9 comments
trafficstars

System Info

infinity 0.0.53 OS version: linux Model being used: dunzhang/stella_en_1.5B_v5 Hardware used: NVIDIA A100

Information

  • [ ] Docker
  • [X] The CLI directly via pip

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

I run following Command: infinity_emb v2 --model-id dunzhang/stella_en_1.5B_v5 --port 3002 --trust-remote-code --served-model-name embedding Then i use /embeddings api: { "input": [ "5.2" ], "model": "embedding" } I got list of null value of embedding, i try some other model and they are not return null value like this model

Expected behavior

This should return list of float value

MLlove0402 avatar Jul 31 '24 15:07 MLlove0402

Can you post the full logs?

michaelfeil avatar Jul 31 '24 15:07 michaelfeil

Can you post the full logs?

Here my full log INFO 2024-08-01 01:49:48,998 datasets INFO: PyTorch version 2.3.1 available. config.py:58 INFO: Started server process [116201] INFO: Waiting for application startup. INFO 2024-08-01 01:49:50,366 infinity_emb INFO: model=/cloudata/thainq/models/models/stella_en_1.5B_v5/ selected, using engine=torch and device=None select_model.py:57 INFO 2024-08-01 01:49:50,369 sentence_transformers.SentenceTransformer INFO: Use pytorch device_name: cuda SentenceTransformer.py:189 INFO 2024-08-01 01:49:50,369 sentence_transformers.SentenceTransformer INFO: Load pretrained SentenceTransformer: /cloudata/thainq/models/models/stella_en_1.5B_v5/ SentenceTransformer.py:197 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 2024-08-01 01:49:56,722 sentence_transformers.SentenceTransformer INFO: 2 prompts are loaded, with the keys: ['s2p_query', 's2s_query'] SentenceTransformer.py:326 INFO 2024-08-01 01:49:56,925 infinity_emb INFO: Adding optimizations via Huggingface optimum. acceleration.py:46 WARNING 2024-08-01 01:49:56,927 infinity_emb WARNING: BetterTransformer is not available for model: <class 'transformers_modules.modeling_qwen.Qwen2Model'> Continue without bettertransformer modeling code. acceleration.py:57 INFO 2024-08-01 01:49:56,928 infinity_emb INFO: Switching to half() precision (cuda: fp16). sentence_transformer.py:81 INFO 2024-08-01 01:49:57,812 infinity_emb INFO: Getting timings for batch_size=32 and avg tokens per sentence=2 select_model.py:80 7.07 ms tokenization 27.29 ms inference 0.13 ms post-processing 34.49 ms total embeddings/sec: 927.83 INFO 2024-08-01 01:49:58,583 infinity_emb INFO: Getting timings for batch_size=32 and avg tokens per sentence=512 select_model.py:86 16.83 ms tokenization 338.24 ms inference 0.39 ms post-processing 355.46 ms total embeddings/sec: 90.03 INFO 2024-08-01 01:49:58,586 infinity_emb INFO: model warmed up, between 90.03-927.83 embeddings/sec at batch_size=32 select_model.py:87 INFO 2024-08-01 01:49:58,588 infinity_emb INFO: creating batching engine batch_handler.py:321 INFO 2024-08-01 01:49:58,589 infinity_emb INFO: ready to batch requests. batch_handler.py:384 INFO 2024-08-01 01:49:58,593 infinity_emb INFO: infinity_server.py:63

     ♾️  Infinity - Embedding Inference Server                                                                                                                                                                                           
     MIT License; Copyright (c) 2023 Michael Feil
     Version 0.0.53

     Open the Docs via Swagger UI:
     http://0.0.0.0:3002/docs

     Access model via 'GET':
     curl http://0.0.0.0:3002/models

INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:3002 (Press CTRL+C to quit) INFO: 172.16.250.26:50758 - "POST /embeddings HTTP/1.1" 200 OK

MLlove0402 avatar Aug 01 '24 01:08 MLlove0402

To use this package, you need ensure that your model is compatible with https://github.com/UKPLab/sentence-transformers. Is it compatible?

You are using cloudata/thainq/models/models a local model. This is not officially supported / do at your own risk. Reason is, that you would need to have all files (config.json, model.safetensors, sentence_transformers.config, sentence-transformers config layers). This might be the source of the bug.

michaelfeil avatar Aug 01 '24 02:08 michaelfeil

To use this package, you need ensure that your model is compatible with https://github.com/UKPLab/sentence-transformers. Is it compatible?

Yes i'm also run this model on sentence-transformers and it work well

MLlove0402 avatar Aug 01 '24 02:08 MLlove0402

Did you try it with the docker container?

michaelfeil avatar Aug 01 '24 03:08 michaelfeil

Did you try it with the docker container?

I'm not try docker yet, but i notice every model use QWEN2 return list embedding null with my above input (gt-qwen2-1.5, gte-qwen2-7), is it a problems ?

MLlove0402 avatar Aug 01 '24 08:08 MLlove0402

Also seeing this issue with Qwen based models, https://huggingface.co/dunzhang/stella_en_400M_v5 and the 1.5B variant both have this problem.

ptkenny avatar Sep 16 '24 12:09 ptkenny

Can confirm that we've also had this issue with dunzhang/stella_en_1.5B_v5, running in Docker.

brannt avatar Sep 16 '24 12:09 brannt

Problem solved. Run model with float32

MLlove0402 avatar Sep 16 '24 13:09 MLlove0402

@brannt @ptkenny @MLlove0402 This is solved. The model leads to torch.nan this is dependent on the input length etc.

Models like dunzhang/stella_en_1.5B_v5 are best to be used with --dtype float32 or --dtype bfloat16 --device cuda

michaelfeil avatar Oct 10 '24 08:10 michaelfeil

Hi @michaelfeil, thank you so much for your responses! Your response on another issue helped me figure out that I needed to use the fa image, and now the float32 dtype.

Qwen models are pretty quirky, I wonder why

TheOnlyWayUp avatar Oct 17 '24 14:10 TheOnlyWayUp

@TheOnlyWayUp I am releasing soon a new ci that automatically includes the flash attention and onnx-gpu packages.

The qwen models should work best with bfloat16. bfloat16 is as fast as float16, but runs with the accuracy of float32.

michaelfeil avatar Oct 19 '24 07:10 michaelfeil