Shitao Xiao comments

Results 509 comments of


                                            Shitao Xiao

bug：多卡同步的时候，cross_targets计算方式是不是有问题？

您好，应该是没有问题的。多卡数据同步后，分数依然是一个矩阵，第i个query的pos在第i*group-size个。

bge m3混合检索

我们实验中使用的是faiss + pyserini: https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse 社区中有一些实现：[vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)

BGE-M3 模型加载失败

是的，可以提高transformers版本，比如4.37

bge-reranker-large 打分问题

text-embeddings-inference和SentenceTransformer应该是接了一个sigmoid层，将分数映射到了0-1之间。由于是一个单调函数，分数的大小关系没变，实际重排效果是一致的。

[C-MTEB] How to convert QA dataset to Retrieval & Reranking Dataset

For QA datasets, we use query as `query`, and use answer/context as `pos`. We use the candidate (except ground truth) provided by the original dataset as `neg`. If there are...

[C-MTEB] How to convert QA dataset to Retrieval & Reranking Dataset

A possible method is utilizing GPT to filter these questions. Using the cosine similarity between questions and answers is more simple, but the threshold is difficult to set.

Bge-M3模型的instruction问题

不需要

关于finetune

参考bgev1.5的微调参数就好：https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#3-train batch_size尽量大一些，learning rate=1e-5 or 5e-6， temperature=0.02

搜索引擎结合问题

sparse embedding的使用方式我们也在考虑中，之前使用的是pyserini（https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse），比较繁琐。milvus等引擎有使用sparse embedding的功能，我们后续会接入。也欢迎提交PR帮助我们完善。

搜索引擎结合问题

You can use the hybrid retrieval of bge-m3 following https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py