ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Feature Request]: Introduce bce text embedding model as an alternative

Open yingfeng opened this issue 4 months ago • 3 comments

Is there an existing issue for the same feature request?

  • [X] I have checked the existing issues.

Is your feature request related to a problem?

No response

Describe the feature you'd like

bce is an excellent embedding model compared with current bge. So we need to add it as an alternative.

Describe implementation you've considered

No response

Documentation, adoption, use case

No response

Additional information

No response

yingfeng avatar Apr 11 '24 15:04 yingfeng

from BCEmbedding import EmbeddingModel

# list of sentences
sentences = ['sentence_0', 'sentence_1']

# init embedding model
model = EmbeddingModel(model_name_or_path="maidalun1020/bce-embedding-base_v1")

# extract embeddings
embeddings = model.encode(sentences)

Umpire2018 avatar Apr 12 '24 02:04 Umpire2018

try:
    flag_model = FlagModel(os.path.join(
            get_project_base_directory(),
            "rag/res/bge-large-zh-v1.5"),
                           query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
                           use_fp16=torch.cuda.is_available())
except Exception as e:
    flag_model = FlagModel("BAAI/bge-large-zh-v1.5",
                       query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
                       use_fp16=torch.cuda.is_available())

Maybe we can set bce as default hf model? Because it supports Chinese and English.

bge-large-zh-v1.5 only support chinese.

  1. Currently the project use text-embedding-ada-002 for openai embedding. Maybe it's time to use latest text-embedding-3-small? Reference here.

Umpire2018 avatar Apr 12 '24 02:04 Umpire2018

Please be careful that maidalun1020/bce-embedding-base_v1 does not need "query instruction", and you'd better use BCEmbedding instead.

shenlei1020 avatar Apr 12 '24 02:04 shenlei1020