Shitao Xiao

Results 509 comments of Shitao Xiao

您好,应该是没有问题的。多卡数据同步后,分数依然是一个矩阵,第i个query的pos在第i*group-size个。

我们实验中使用的是faiss + pyserini: https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse 社区中有一些实现:[vespa](https://github.com/vespa-engine/pyvespa/blob/master/docs/sphinx/source/examples/mother-of-all-embedding-models-cloud.ipynb)

是的,可以提高transformers版本,比如4.37

text-embeddings-inference和SentenceTransformer应该是接了一个sigmoid层,将分数映射到了0-1之间。由于是一个单调函数,分数的大小关系没变,实际重排效果是一致的。

For QA datasets, we use query as `query`, and use answer/context as `pos`. We use the candidate (except ground truth) provided by the original dataset as `neg`. If there are...

A possible method is utilizing GPT to filter these questions. Using the cosine similarity between questions and answers is more simple, but the threshold is difficult to set.

参考bgev1.5的微调参数就好:https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune#3-train batch_size尽量大一些,learning rate=1e-5 or 5e-6, temperature=0.02

sparse embedding的使用方式我们也在考虑中,之前使用的是pyserini(https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB/MLDR#hybrid-retrieval-dense--sparse),比较繁琐。milvus等引擎有使用sparse embedding的功能,我们后续会接入。也欢迎提交PR帮助我们完善。

You can use the hybrid retrieval of bge-m3 following https://github.com/milvus-io/pymilvus/blob/master/examples/hello_hybrid_sparse_dense.py