ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Bug]: Retrieving the knowledge base through the python API has different results from directly retrieving the test, and the rerank id parameter is invalid at the same time

Open lhxxrds opened this issue 1 month ago • 9 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

RAGFlow workspace code commit ID

56789

RAGFlow image version

v0.22.1

Other environment information

Retrieving the knowledge base through the python API has different results from directly retrieving the test, and the rerank id parameter is invalid at the same time

Actual behavior

Retrieving the knowledge base through the python API has different results from directly retrieving the test, and the rerank id parameter is invalid at the same time

Expected behavior

Retrieving the knowledge base through the python API has different results from directly retrieving the test, and the rerank id parameter is invalid at the same time

Steps to reproduce

Retrieving the knowledge base through the python API has different results from directly retrieving the test, and the rerank id parameter is invalid at the same time

Additional information

Retrieving the knowledge base through the python API has different results from directly retrieving the test, and the rerank id parameter is invalid at the same time

lhxxrds avatar Nov 27 '25 05:11 lhxxrds

There are known issues in RAGFlow v0.22.1 that can cause the Python API to return different retrieval results compared to direct retrieval (such as via the UI or /retrieval_test endpoint). This is often due to differences in how pagination and reranking interact, especially with the RERANK_LIMIT setting, which can lead to inconsistent total counts and returned records depending on the page_size you use. This bug was fixed in later/nightly versions, so upgrading is recommended if possible [details].

For the rerank_id parameter, it must exactly match the name of a reranker model registered for your tenant, including any factory/vendor suffix (for example, reranker@VLLM). If the value doesn't match what's registered, you'll get an invalid rerank id error. You can check your registered rerank models via the admin UI, SDK, or API. Most errors here are due to mismatches between the provided value and the actual registered model name [details].

Other factors that can cause discrepancies include differences in session handling, timeout/network configuration, or parameter defaults (like similarity_threshold, vector_similarity_weight, or top_k). Make sure your API and direct retrieval requests use the same parameters and environment settings [details].

If you need more targeted troubleshooting, please clarify what you mean by "directly retrieving the test" (e.g., UI, CLI, or another API endpoint) and share any specific error messages you see with the rerank_id parameter.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 27 '25 05:11 dosubot[bot]

There is a significant performance difference between the results returned by directly searching for questions in the retrieval test in the knowledge base and those returned by using the python API to retrieve the knowledge base. I have also made modifications as per your requirements, but the results are still the same. The following is my code. If there are any problems, please point them out: from ragflow_sdk import RAGFlow

初始化 RAGFlow 客户端

rag = RAGFlow(api_key="*****8", base_url="")

question = "山西现货市场二次限价是多少?"

chunks = rag.retrieve( question=question, dataset_ids=["98e806d0cb3011f0b7878ac63072bd56"],
similarity_threshold=0.2,
top_k=1024,
page=1, keyword=True,
page_size=10, rerank_id= "reranker@VLLM", )

打印检索结果

for chunk in chunks: print(chunk) # print(f"Content: {chunk.content}") # print("-" * 50) ntion @dosu.

lhxxrds avatar Nov 27 '25 05:11 lhxxrds

Your Python code is generally correct, but the differences in retrieval results and rerank_id issues are mostly due to backend bugs and parameter handling in RAGFlow v0.22.1.

  1. Rerank ID Format: The rerank_id must exactly match the registered model name in the backend, including vendor suffixes. For VLLM models, you may need to use formats like bge-m3@VLLM or bge-m3___VLLM instead of reranker@VLLM. If the rerank_id does not match, reranking will be skipped or you'll get an error—even if the model appears in the UI dropdown. Double-check the backend configuration or use browser dev tools to inspect the exact rerank_id used in UI requests [details].

  2. Parameter Alignment: Make sure all parameters (similarity_threshold, vector_similarity_weight, top_k, keyword, page_size, etc.) match between your SDK call and the UI retrieval test. Even small differences can cause result discrepancies. The SDK uses vector_similarity_weight=0.3 by default, while the UI may use different weights or defaults [details].

  3. Version Bugs: RAGFlow v0.22.1 has known bugs in pagination, reranking, and metadata filtering that cause inconsistent results between API and UI retrieval. These were fixed in later versions, so upgrading is recommended for full consistency [details, PR #11412].

  4. Metadata Filtering: If you use metadata_condition for filtering, it is unreliable in v0.22.1 due to backend bugs. Filtering only works if the metadata field is present in every document's meta_fields, and even then, results may not be consistent. Tags are the only reliable workaround in this version [details].

  5. Retrieval Variability: Retrieval results can vary due to similarity thresholds, keyword/vector weights, and reranking randomness. Lowering the similarity threshold or increasing top_k can help make results more consistent [details].

If you still see discrepancies, try capturing the full request payloads from both the SDK and UI (using browser dev tools) and compare them for hidden differences. If you get specific error messages with rerank_id, share them for more targeted troubleshooting.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 27 '25 05:11 dosubot[bot]

In ragflow, for instance, when using python apis to access different knowledge bases, is it necessary to apply for separate API keys for different knowledge bases. tion @dosu.

lhxxrds avatar Nov 27 '25 06:11 lhxxrds

You do not need separate API keys for each knowledge base in RAGFlow. API keys are scoped at the tenant (account/workspace) level, so a single API key gives you access to all knowledge bases owned by that tenant. To switch between knowledge bases in your Python API calls, just change the dataset_ids parameter—no new API key is needed. You would only need a different API key if you want to access knowledge bases under a different tenant (account) than your current one.
Reference

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 27 '25 06:11 dosubot[bot]

A very serious problem. I found that using the rerank id parameter in the python API is only valid for the first created knowledge base. The second, third, or subsequent ones created are all invalid。on @dosu.

lhxxrds avatar Nov 27 '25 06:11 lhxxrds

This is a backend limitation in RAGFlow v0.22.1: when you use the Python API and specify multiple knowledge bases, the rerank_id is only validated and applied using the tenant and registry of the first knowledge base in your list. If your other knowledge bases are under different tenants or have different rerank model registrations, rerank_id will be invalid for them. This is not a documented feature, but rather an implementation detail of how the backend instantiates rerank models—using only the first knowledge base's tenant context for rerank_id lookup and model creation. There is no explicit restriction in the code, but this logic causes the behavior you observed see code reference.

To avoid this, make sure all knowledge bases in your request belong to the same tenant and have the rerank model registered for that tenant. If you need to use rerank_id across different tenants or knowledge bases with different configurations, you’ll need to make separate API calls for each one. This limitation is present in v0.22.1 and there is no patch for it in that version; recent bug fixes do not address this specific issue.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 27 '25 06:11 dosubot[bot]

You misunderstood me above. What I mean is that I created four knowledge bases in my personal account, but when accessing them using the same api-key, I accessed them using the python api method. I found that only in terms of time, the first created knowledge base supports the rerank id parameter. Even though I later tried other methods to keep the same, this parameter still didn't work. t mention @dosu.

lhxxrds avatar Nov 27 '25 06:11 lhxxrds

The rerank_id parameter is not tied to the creation order of your knowledge bases. Instead, it's validated using the tenant (your account) associated with the first knowledge base in your API request. All knowledge bases under the same tenant should support the same rerank_id, as long as the rerank model is registered for your tenant and the rerank_id matches exactly (including any required suffix, like @VLLM) [backend code reference].

If rerank_id only works for your first-created knowledge base, but not for others in the same account, this usually points to a registration or configuration mismatch—there’s no backend logic that treats the first-created knowledge base differently. Double-check that all your knowledge bases are under the same tenant, and that the rerank model is registered and active for your tenant. You can verify the exact rerank_id by listing your registered models via the admin UI or API [details].

If everything matches and the issue persists, it may be a bug in v0.22.1. Upgrading to a newer version is recommended, as several retrieval and rerank bugs have been fixed in later releases. If you need to troubleshoot further, try creating a new knowledge base and test rerank_id, or share any specific error messages you see.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 27 '25 06:11 dosubot[bot]

What's the difference between python API and directly retrieving the test. What do you mean by directly retrieving the test. Please provide screenshots.

Magicbook1108 avatar Nov 28 '25 08:11 Magicbook1108