Hybrid search deduplication mixes text and vector results with incompatible score scales
Self Checks
- [x] I have read the Contributing Guide and Language Policy.
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report, otherwise it will be closed.
- [x] 【中文用户 & Non English User】请使用英语提交,否则会被关闭 :)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
1.9.2
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
-
Configure Milvus as the vector database.
-
Run a hybrid search query that returns results from both:
- Full-text search (inner product metric, unbounded positive scores)
- Vector search (cosine similarity, score in
[-1, 1])
-
add debug log
print("###DEBUG###", document.metadata['doc_id'], score)to https://github.com/langgenius/dify/blob/fd255e81e13299ac2caf86eab432d899282c73dc/api/core/rag/rerank/weight_rerank.py#L66
✔️ Expected Behavior
Hybrid search deduplication and reranking should normalize or handle text and vector scores separately so that their score ranges are comparable (e.g., both scaled to [0, 1] or otherwise weighted before merging). The final ranking should correctly reflect the intended relevance balance between text and semantic similarity.
❌ Actual Behavior
When Milvus is used, text search results (using inner product) have extremely large positive scores, while vector results (cosine similarity) are in the range [-1, 1]. During deduplication and reranking, these scores are mixed without normalization, causing:
Text search results to dominate due to larger magnitude
Unstable or incorrect ranking order in the hybrid results
Hi @myml! I'm Dosu and I’m helping the dify team.
You’re right—Dify v1.9.2 does not normalize or scale text (inner product) and vector (cosine similarity) scores before deduplication and reranking in hybrid search with Milvus. The reranking logic in weight_rerank.py just applies weights and sums the raw scores, so large-magnitude text scores dominate the ranking, causing unstable or incorrect results. There’s no built-in normalization step for these scores as of this version, and recent code or support threads don’t show an official fix or workaround yet source.
To address this, you could patch the rerank logic to normalize each score type before combining—using min-max scaling, z-score normalization, or mapping both to a common range (like [0, 1]). Another option is to tune the weights so neither modality overwhelms the other, but normalization is the more robust fix.
If you need a code-level suggestion for where to add normalization, let me know!
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other