使用bge-rerank-large对两个段落重排两次，两次重排前会对所有段落进行strip()，结果大不相同，为什么会有这么大的差异？

Open yumingmin88 opened this issue 11 months ago • 1 comments

该问题最初来源于：使用不同库（sentence_transformers等）加载bge-rerank-large，进行重排，发现结果都不相同（指重排后的顺序），其中一个原因是由于sentence_transformers会对输入去除空白符使用以下脚本模拟一下sentence_transformers去除空白符的操作：

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "/data/reranker/bge-reranker-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def rerank(query, documents):
    pairs = [(query, doc) for doc in documents]
    
    inputs = tokenizer(
        pairs,
        padding=True,
        truncation=True,
        return_tensors="pt",
        max_length=512
    ).to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
        scores = outputs.logits.view(-1).float().tolist()
    return scores
query, candidates = 'J2智能扫地机器人能吸得干净猫毛吗？',[' Q:边刷炸毛严重\nA:机器人在地毯等材质上面清洁，可能导致出现容易炸毛的情况。边刷炸毛对扫地功能、清洁效果影响不大，可定期更换边刷使用\n', ' Q:边刷炸毛严重\nA:机器人在地毯等材质上面清洁，可能导致出现容易炸毛的情况。边刷炸毛对扫地功能、清洁效果影响不大，可定期更换边刷使用。\n']
candidates_ = [i.strip() for i in candidates]
scores = rerank(query, candidates_)
res_scores = sorted([(i, j) for i, j in enumerate(scores)], key=lambda x: -x[1])
print("strip(),Ranked Candidates:")
for idx, score in res_scores:
    print(f"index:{idx} socre:{score}")
scores = rerank(query, candidates)
res_scores = sorted([(i, j) for i, j in enumerate(scores)], key=lambda x: -x[1])
print()
print("no strip() Ranked Candidates:")
for idx, score in res_scores:
    print(f"index:{idx} socre:{score}")

得到的结果为：

strip(),Ranked Candidates:
index:0 socre:1.6036423444747925
index:1 socre:0.7816800475120544

no strip() Ranked Candidates:
index:0 socre:0.8946816325187683
index:1 socre:0.8705350756645203

可以看到candidates中两段文本前后的差异，只是去除了首尾的空白符，为什么前后差异会这么大？这两段如果处于100个段中，那这两段重排后的顺序会相差非常多（一开始也是从100段中拿出来的这两段）是由于"\n"在预训练的时候，重要性非常高吗？在进行重排时一定不能去掉吗？谢谢，期待您的指点

Jan 16 '25 08:01 yumingmin88

这个正常来讲使用统一的格式，对相对排名影响不是很大的如果结果比较差，可以考虑使用bge-reranker-v2-m3

Jan 23 '25 09:01 545999961