Inconsistent Scoring in WeightRerankRunner Due to Skipped _calculate_cosine for Pre-scored Documents

Open rainsoft opened this issue 7 months ago • 1 comments

Self Checks

[x] This is only for bug report, if you would like to ask a question, please head to Discussions.
[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:)
[x] Please do not modify this template :) and fill in all the required fields.

Dify version

v.1.3.0

Cloud or Self Hosted

Self Hosted (Source)

Steps to reproduce

In WeightRerankRunner line 41, when merging the documents, since the documents may contain both vector retrieval and full-text retrieval results, there is a possibility that the same document's score could come from either the vector retrieval or the full-text retrieval. In the _calculate_cosine method, if a document already has a score, it skips recalculating, which means this method may return the full-text retrieval score, leading to unexpected results.

✔️ Expected Behavior

Consistent Scoring

Ensure _calculate_cosine always recomputes cosine scores for documents, ignoring pre-existing scores (from full-text).

❌ Actual Behavior

_calculate_cosine return vector score or full-text score sometimes.

Apr 27 '25 13:04 rainsoft

To ensure consistent scoring in WeightRerankRunner by making _calculate_cosine always recompute cosine scores for documents, you need to modify the _calculate_cosine method to ignore any pre-existing scores and always calculate the cosine similarity from scratch. Here's an improved version of the _calculate_cosine method:

def _calculate_cosine(
    self, tenant_id: str, query: str, documents: list[Document], vector_setting: VectorSetting
) -> list[float]:
    """
    Calculate Cosine scores
    :param query: search query
    :param documents: documents for reranking

    :return:
    """
    # Extract keywords from the query
    keyword_table_handler = JiebaKeywordTableHandler()
    query_keywords = keyword_table_handler.extract_keywords(query, None)

    # Counter query keywords(TF)
    query_keyword_counts = Counter(query_keywords)

    # Calculate TF-IDF for the query
    query_tfidf = {}
    for keyword, count in query_keyword_counts.items():
        tf = count
        idf = self.keyword_idf.get(keyword, 0)
        query_tfidf[keyword] = tf * idf

    # Extract keywords and calculate TF-IDF for each document
    documents_tfidf = []
    for document in documents:
        document_keywords = keyword_table_handler.extract_keywords(document.page_content, None)
        document_keyword_counts = Counter(document_keywords)
        document_tfidf = {}
        for keyword, count in document_keyword_counts.items():
            tf = count
            idf = self.keyword_idf.get(keyword, 0)
            document_tfidf[keyword] = tf * idf
        documents_tfidf.append(document_tfidf)

    # Calculate cosine similarities
    def cosine_similarity(vec1, vec2):
        intersection = set(vec1.keys()) & set(vec2.keys())
        numerator = sum(vec1[x] * vec2[x] for x in intersection)

        sum1 = sum(vec1[x] ** 2 for x in vec1)
        sum2 = sum(vec2[x] ** 2 for x in vec2)
        denominator = math.sqrt(sum1) * math.sqrt(sum2)

        if not denominator:
            return 0.0
        else:
            return float(numerator) / denominator

    similarities = []
    for document_tfidf in documents_tfidf:
        similarity = cosine_similarity(query_tfidf, document_tfidf)
        similarities.append(similarity)

    return similarities

This code ensures that the cosine similarity is recalculated for each document based on the current query and document TF-IDF vectors, ignoring any pre-existing scores ^[1]^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

Apr 27 '25 13:04 dosubot[bot]