Is the current integration of the embedding model and the reranking model not quite optimal?

Open listeng opened this issue 1 year ago • 1 comments

Self Checks

[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] Pleas do not modify this template :) and fill in all the required fields.

Dify version

0.5.11

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

configurated the embedding model and the reranking model with some documents in knowledge.
ask a question.

✔️ Expected Behavior

retrieved 20(or more) documents from the embedding model and pass them to reranker, return the top_k = 3 from reranker result.

❌ Actual Behavior

retrieved top_k = 3 documents from the embedding model and pass the three documents to reranker. Sorting these three documents is already meaningless, isn't it?

in api/core/rag/datasource/retrieval_service.py:

documents = vector.search_by_vector(
                query,
                search_type='similarity_score_threshold',
                top_k=top_k,   # <<<<<<<<  this had ready limit the results
                score_threshold=score_threshold,
                filter={
                    'group_id': [dataset.id]
                }
            )

Maybe we could add a if condition:

top_k1 = top_k
if reranking_model:
    top_k1 = 20
documents = vector.search_by_vector(
                query,
                search_type='similarity_score_threshold',
                top_k=top_k1, # <<<<<< To have it return more results
                score_threshold=score_threshold,
                filter={
                    'group_id': [dataset.id]
                }
            )
...
...
data_post_processor = DataPostProcessor(str(dataset.tenant_id), reranking_model, False)
                    all_documents.extend(data_post_processor.invoke(
                        query=query,
                        documents=documents,
                        score_threshold=score_threshold,
                        top_n=top_k # <<<<<<<<< this is real top_k
                    ))

Would this optimization be feasible? Thank you.

Apr 01 '24 02:04 listeng

That sounds like a great proposal! Your suggested solution seems like a feasible optimization to address the issue you've identified. We encourage you to go ahead and create a pull request with your proposed changes. Your contribution will be valuable to the project. Thank you for your interest in improving Dify!

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Apr 01 '24 02:04 dosubot[bot]