Inconsistent Scoring in WeightRerankRunner Due to Skipped _calculate_cosine for Pre-scored Documents
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
v.1.3.0
Cloud or Self Hosted
Self Hosted (Source)
Steps to reproduce
In WeightRerankRunner line 41, when merging the documents, since the documents may contain both vector retrieval and full-text retrieval results, there is a possibility that the same document's score could come from either the vector retrieval or the full-text retrieval. In the _calculate_cosine method, if a document already has a score, it skips recalculating, which means this method may return the full-text retrieval score, leading to unexpected results.
✔️ Expected Behavior
Consistent Scoring
Ensure _calculate_cosine always recomputes cosine scores for documents, ignoring pre-existing scores (from full-text).
❌ Actual Behavior
_calculate_cosine return vector score or full-text score sometimes.
To ensure consistent scoring in WeightRerankRunner by making _calculate_cosine always recompute cosine scores for documents, you need to modify the _calculate_cosine method to ignore any pre-existing scores and always calculate the cosine similarity from scratch. Here's an improved version of the _calculate_cosine method:
def _calculate_cosine(
self, tenant_id: str, query: str, documents: list[Document], vector_setting: VectorSetting
) -> list[float]:
"""
Calculate Cosine scores
:param query: search query
:param documents: documents for reranking
:return:
"""
# Extract keywords from the query
keyword_table_handler = JiebaKeywordTableHandler()
query_keywords = keyword_table_handler.extract_keywords(query, None)
# Counter query keywords(TF)
query_keyword_counts = Counter(query_keywords)
# Calculate TF-IDF for the query
query_tfidf = {}
for keyword, count in query_keyword_counts.items():
tf = count
idf = self.keyword_idf.get(keyword, 0)
query_tfidf[keyword] = tf * idf
# Extract keywords and calculate TF-IDF for each document
documents_tfidf = []
for document in documents:
document_keywords = keyword_table_handler.extract_keywords(document.page_content, None)
document_keyword_counts = Counter(document_keywords)
document_tfidf = {}
for keyword, count in document_keyword_counts.items():
tf = count
idf = self.keyword_idf.get(keyword, 0)
document_tfidf[keyword] = tf * idf
documents_tfidf.append(document_tfidf)
# Calculate cosine similarities
def cosine_similarity(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum(vec1[x] * vec2[x] for x in intersection)
sum1 = sum(vec1[x] ** 2 for x in vec1)
sum2 = sum(vec2[x] ** 2 for x in vec2)
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
similarities = []
for document_tfidf in documents_tfidf:
similarity = cosine_similarity(query_tfidf, document_tfidf)
similarities.append(similarity)
return similarities
This code ensures that the cosine similarity is recalculated for each document based on the current query and document TF-IDF vectors, ignoring any pre-existing scores [1][2].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other