inference 同一个Reranker模型，相同的输入，在xinference跟原生Transformers下差别很大，且xinference输出错误的结果

System Info / 系統信息

CUDA Version: 12.4 Ubuntu 22.04.3 LTS

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[x] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

xinference: v1.3.1

transformers: 4.40.1

The command used to start Xinference / 用以启动 xinference 的命令

docker run -d --name xinference --restart=always \
-e HF_ENDPOINT=https://hf-mirror.com \
-e HUGGING_FACE_HUB_TOKEN=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 \
-e LOG_TZ=Asia/Shanghai \
-e TZ=Asia/Shanghai \
-v /root/.xinference:/root/.xinference \
-v /root/.cache/huggingface:/root/.cache/huggingface \
-v /root/.cache/modelscope:/root/.cache/modelscope \
-v /data2/models:/data2/models \
-p 9997:9997 \
-p 8777:8777 \
--gpus all \
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v1.3.1 \
xinference-local -H 0.0.0.0 --port 9997 -mp 8777

Reproduction / 复现过程

询问中国的首都是哪里？ xinference返回上海分数更高，不符合预期；而Transformers是北京，符合预期。

Launch MiniCPM-Reranker-Light within xinference
curl to the reranker model:

curl -X 'POST' 'http://xxx:9997/v1/rerank' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "MiniCPM-Reranker-Light",
    "query": "中国的首都是哪里？",
    "documents": [
        "beijing",
        "shanghai"
    ]
  }'

Outs:

{"id":"4e605156-0634-11f0-a430-0242c0a80102","results":[{"index":1,"relevance_score":0.021355781704187393,"document":null},{"index":0,"relevance_score":0.011472251266241074,"document":null}],"meta":{"api_version":null,"billed_units":null,"tokens":null,"warnings":null}}

Run python code with Transfermers:

from transformers import AutoModelForSequenceClassification
import torch

model_name = "openbmb/MiniCPM-Reranker-Light"
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
# You can also use the following code to use flash_attention_2
# model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True,attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

query = "中国的首都是哪里？" # "Where is the capital of China?"
passages = ["beijing", "shanghai"] # 北京，上海

rerank_score = model.rerank(query, passages,query_instruction="Query:", batch_size=32, max_length=1024)
print(rerank_score) #[0.01791382 0.00024533]


sentence_pairs = [[f"Query: {query}", doc] for doc in passages]
scores = model.compute_score(sentence_pairs, batch_size=32, max_length=1024)
print(scores) #[0.01791382 0.00024533]

outputs:

You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to(
'cuda')`.                                                                                                                                                         
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.09it/s]
[0.01785278 0.00024915]                                                                                                                                           
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.27it/s]
[0.01785278 0.00024915]

Expected behavior / 期待表现

预期是北京的分数更高

Mar 21 '25 09:03 zhangever

Xinference 背后 engine 是 sentence-transformers。这个需要复现下。

Mar 22 '25 08:03 qinxuye

这个两个分数一致的

Mar 25 '25 03:03 zhangever

我看到有 tokenizer 设置 padding_side 啥的，我不确定是不是完全一致。

Mar 25 '25 03:03 qinxuye

我看到有 tokenizer 设置 padding_side 啥的，我不确定是不是完全一致。

现在用xinference的reranker模型，出来的效果比不重排还要差。

Mar 25 '25 06:03 zhangever

This issue is stale because it has been open for 7 days with no activity.

Apr 01 '25 19:04 github-actions[bot]

秦总支持下呢 @qinxuye

Apr 07 '25 03:04 zhangever

本周会定位下。

Apr 07 '25 04:04 qinxuye

This issue is stale because it has been open for 7 days with no activity.

Apr 14 '25 19:04 github-actions[bot]

持续关注。

Apr 20 '25 13:04 zhangever

This issue is stale because it has been open for 7 days with no activity.

Apr 27 '25 19:04 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

May 02 '25 19:05 github-actions[bot]

This issue is stale because it has been open for 7 days with no activity.

May 11 '25 19:05 github-actions[bot]

@llyycchhee 请 track 下这个问题。

May 12 '25 03:05 qinxuye

这个是因为MiniCPM-Reranker-Ligh模型需要加上 INSTRUCTION="Query: " 放在每个query之前，用户可以在xinference的curl中的query参数根据模型需要更改为 "query"："Query: 中国的首都是哪里？"。后续会在文档中补充说明

May 14 '25 07:05 llyycchhee

亲测可以。 xinference这边能否自动给用户带上这个Query: 呢？因为很多平台，包括dify，是比较难处理这个INSTRUCTION的. @llyycchhee

May 15 '25 02:05 zhangever