ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: Inconsistent page_size Behavior in retrieval Function When doc_ids is Provided

Open RyzeAngler opened this issue 7 months ago • 1 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

cc8029a7

RAGFlow image version

3a34def55f161

Other environment information


Actual behavior

In the retrieval function of the Dealer class, located in rag/nlp/search.py (starting from line 348), there is a forced override of the page_size parameter when the doc_ids parameter is present. This raises questions about parameter consistency and could lead to unexpected behavior. Specifically, in the following code snippet:

if doc_ids: similarity_threshold = 0 page_size = 30

When doc_ids is True, the page_size variable within the function is unconditionally reassigned to 30, regardless of the value initially passed by the caller. Problem Analysis: Parameter Semantic Inconsistency: The retrieval function explicitly accepts a page_size parameter, which typically implies that the caller can control the number of results returned. However, in the doc_ids scenario, this forced override renders the caller's page_size value ineffective, leading to unclear parameter semantics and unpredictable function behavior. Potential Computational Waste: The calculation of idx (which selects chunk indices from the reranked results) on line 382: idx = np.argsort(sim * -1)[(page - 1) * page_size:page * page_size] This calculation uses the original page_size value passed by the caller. This means if the initial page_size is, for example, 100, the system will process and identify indices for 100 candidate chunks. However, the subsequent loop's length check and the final truncation will use the page_size value that has been forcibly set to 30 (when doc_ids is true). This leads to a mismatch where more chunks are processed initially than are ultimately returned, potentially causing unnecessary computation for the unused chunks. Limited Flexibility: This hardcoded limitation restricts the ability to flexibly control the number of returned chunks when doc_ids are specified. Users might need fewer than 30 chunks (e.g., just a few representative snippets from specific documents) or more than 30 (e.g., all relevant chunks from a set of specified documents).

Expected behavior

To address these issues and improve the function's behavior, I propose two possible solutions: Remove the forced page_size = 30 override: The most straightforward solution is to simply remove line 390, page_size = 30; from rag/nlp/search.py.

Steps to reproduce

...

Additional information

No response

RyzeAngler avatar Jun 12 '25 10:06 RyzeAngler

I found this code was merged when solving the problem #6228. I don't quite understand why the problem of being unable to parse uploaded images is limiting page_size. https://github.com/infiniflow/ragflow/issues/6228

RyzeAngler avatar Jun 12 '25 10:06 RyzeAngler

Since there has been no further activity for over three weeks, we will proceed to close this issue. If the problem persists or you have additional questions, please feel free to reopen the issue or create a new one. We’re happy to assist anytime.

Magicbook1108 avatar Dec 17 '25 07:12 Magicbook1108