同一个Reranker模型,相同的输入,在xinference跟原生Transformers下差别很大,且xinference输出错误的结果
System Info / 系統信息
CUDA Version: 12.4 Ubuntu 22.04.3 LTS
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- [x] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装
Version info / 版本信息
xinference: v1.3.1
transformers: 4.40.1
The command used to start Xinference / 用以启动 xinference 的命令
docker run -d --name xinference --restart=always \
-e HF_ENDPOINT=https://hf-mirror.com \
-e HUGGING_FACE_HUB_TOKEN=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 \
-e LOG_TZ=Asia/Shanghai \
-e TZ=Asia/Shanghai \
-v /root/.xinference:/root/.xinference \
-v /root/.cache/huggingface:/root/.cache/huggingface \
-v /root/.cache/modelscope:/root/.cache/modelscope \
-v /data2/models:/data2/models \
-p 9997:9997 \
-p 8777:8777 \
--gpus all \
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v1.3.1 \
xinference-local -H 0.0.0.0 --port 9997 -mp 8777
Reproduction / 复现过程
询问中国的首都是哪里? xinference返回上海分数更高,不符合预期; 而Transformers是北京,符合预期。
- Launch MiniCPM-Reranker-Light within xinference
- curl to the reranker model:
curl -X 'POST' 'http://xxx:9997/v1/rerank' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "MiniCPM-Reranker-Light",
"query": "中国的首都是哪里?",
"documents": [
"beijing",
"shanghai"
]
}'
Outs:
{"id":"4e605156-0634-11f0-a430-0242c0a80102","results":[{"index":1,"relevance_score":0.021355781704187393,"document":null},{"index":0,"relevance_score":0.011472251266241074,"document":null}],"meta":{"api_version":null,"billed_units":null,"tokens":null,"warnings":null}}
- Run python code with Transfermers:
from transformers import AutoModelForSequenceClassification
import torch
model_name = "openbmb/MiniCPM-Reranker-Light"
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
# You can also use the following code to use flash_attention_2
# model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True,attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()
query = "中国的首都是哪里?" # "Where is the capital of China?"
passages = ["beijing", "shanghai"] # 北京,上海
rerank_score = model.rerank(query, passages,query_instruction="Query:", batch_size=32, max_length=1024)
print(rerank_score) #[0.01791382 0.00024533]
sentence_pairs = [[f"Query: {query}", doc] for doc in passages]
scores = model.compute_score(sentence_pairs, batch_size=32, max_length=1024)
print(scores) #[0.01791382 0.00024533]
outputs:
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to(
'cuda')`.
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.09it/s]
[0.01785278 0.00024915]
Computing scores: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.27it/s]
[0.01785278 0.00024915]
Expected behavior / 期待表现
预期是北京的分数更高
Xinference 背后 engine 是 sentence-transformers。这个需要复现下。
这个两个分数一致的
我看到有 tokenizer 设置 padding_side 啥的,我不确定是不是完全一致。
我看到有 tokenizer 设置 padding_side 啥的,我不确定是不是完全一致。
现在用xinference的reranker模型, 出来的效果比不重排还要差。
This issue is stale because it has been open for 7 days with no activity.
秦总支持下呢 @qinxuye
本周会定位下。
This issue is stale because it has been open for 7 days with no activity.
持续关注。
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.
This issue is stale because it has been open for 7 days with no activity.
@llyycchhee 请 track 下这个问题。
这个是因为MiniCPM-Reranker-Ligh模型需要加上 INSTRUCTION="Query: " 放在每个query之前,用户可以在xinference的curl中的query参数根据模型需要 更改为 "query":"Query: 中国的首都是哪里?"。后续会在文档中补充说明
亲测可以。 xinference这边能否自动给用户带上这个Query: 呢? 因为很多平台, 包括dify,是比较难处理这个INSTRUCTION的. @llyycchhee