FlagEmbedding issues

关于beg_emdedding 微调loss计算

1

### baai_general_embedding.finetune 您好，我在早先issue里看到 [https://github.com/FlagOpen/FlagEmbedding/issues/58](url) 1.如果传参--normlized True \，训练得时候使用余弦相似度来计算score ,同时使用temperature来放缩score 2. 否则就是用内积来计算socre 但是现在我看modeling.py中forward ![image](https://github.com/FlagOpen/FlagEmbedding/assets/125850867/25190653-4dcf-4a7c-9710-8f5924a7574a) 而compute_similarity函数现在用的是内积 ![image](https://github.com/FlagOpen/FlagEmbedding/assets/125850867/75e0392e-834d-4127-a321-308ef72683c0) 现在您是弃用了余弦相似度来计算socre了吗

YJSoooooo

关于CLS和MEAN_POOLING的问题

3

请问在使用embedding模型作为向量召回的场景中，为什么大家都默认是使用CLS作为最后的返回结果，而不是使用MEAN_POOLING或者FIRST_LAST_AVG等其它呢？有数据表明CLS是大多数场景的最优吗？作者是怎样看待这个问题的呢？

blue-vision0

微调的训练数据

3

想问一下bge-large-zh模型微调的三元组数据有开源出来的吗,自己构造训练数据效率有点慢，想先在开源数据集上微调一下看看效果

Vincent2Liu

Need help with English datasets

2

HI , amazing work , highly inspirational. Thanks a lot for make it opensource . which datasets did you use for pre-training english only model? , it is mentioned that...

bharadwajyadati

About temperature, query_instruction_for_retrieval and passage_instruction_for_retrieval?

2

Hi guys, thanks for your great repo. I want to ask some question 1. What is the similarity distribution of model when I set temperature = 0.02? Previously, I saw...

chuan298

reranker的512token计算问题，确认一下

4

请问一下，reranker微调时，512长度的token的计算是将query和pos/neg直接相连接（add）然后计算的吗？代码层面有没有再添加一些过渡性的链接两者的短语/句子？我的过滤训练数据的计算逻辑： from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('reranker路径') def get_token_count(query): return len(tokenizer(query)['input_ids']) query = 'xxx' pos = 'xxx' if get_token_count(query + pos) > 512: continue 我这边逻辑写的比较死，就是512，没有留缓冲，不知道代码里面有没有加默认的过渡性的链接两者的短语/句子？如果有的话，我就不能用512作为过滤的阈值了

mechigonft

请问BGE embedding中eval_msmarco里为什么要用train和dev两个数据集测试

2

在eval_msmarco中出现: eval_data = datasets.load_dataset("namespace-Pt/msmarco", split="dev") corpus = datasets.load_dataset("namespace-Pt/msmarco-corpus", split="train") 之后将corpus作为faiss_index，eval_data 作为query。为什么会这样验证呢？我自己的理解的验证应该是一个QA对数据，将Q作为query，A作为corpus，然后检索，计算各个指标。应该使用一个数据集就够了，为什么不只使用namespace-Pt/msmarco呢？它有query和positive。还是说以下这些指标就是需要两个数据集这么算。 { 'MRR@1': 0.2330945558739255, 'MRR@10': 0.35786976395142633, 'MRR@100': 0.3692618036917553, 'Recall@1': 0.22606255969436478, 'Recall@10': 0.6412965616045848, 'Recall@100': 0.9012774594078318 }

128Ghe980

bge-m3 default representation

3

Hello :) First thank you for your amazing work! When using bge-m3 within langchain what is the default representation of the encoding? Dense or a mix of differents (sparse...)?

nico2rdj