ragflow [Question]: The evaluation of ragflow's search performance using the benchmark.py file failed.

Self Checks

[x] I have searched for existing issues search for existing issues, including closed ones.
[x] I confirm that I am using English to submit this report (Language Policy).
[x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
[x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

I'm currently trying to evaluate the retrieval performance of ragflow. I've completed the following steps:

Downloaded the ms_marco_v1.1 dataset

python3 rag/benchmark.py 10000 db287ed0d4ae11f0b0ee6e0da93083d1 ms_marco_v1.1 ./datasets/ms_marco_v1.1 2. Used the benchmark.py file to evaluate the ms_marco_v1.1 benchmark dataset, but encountered an error.

/ragflow/.venv/lib/python3.10/site-packages/google/api_core/_python_version_support.py:266: FutureWarning: You are using a Python version (3.10.12) which Google will stop supporting in new releases of google.api_core once it reaches its end of life (2026-10-04). Please upgrade to the latest Python version, or at least Python 3.11, to continue receiving updates for google.api_core past that date. warnings.warn(message, FutureWarning) WARNING:root:SECURITY WARNING: Using auto-generated SECRET_KEY. WARNING:root:Load term.freq FAIL! WARNING:root:Realtime synonym is disabled, since no redis connection. WARNING:root:Load term.freq FAIL! WARNING:root:Realtime synonym is disabled, since no redis connection. *****************RAGFlow Benchmark***************** Tokenizing:msmarco_passage.parquet: 0it [00:00, ?it/s] Traceback (most recent call last): File "/ragflow/rag/benchmark.py", line 306, in <module> ex(dataset, dataset_path) File "/ragflow/rag/benchmark.py", line 253, in __call__ print(dataset, evaluate(Qrels(qrels), Run(run), ["ndcg@10", "map@5", "mrr@10"])) File "/ragflow/.venv/lib/python3.10/site-packages/ranx/data_structures/qrels.py", line 76, in __init__ max_len = max(len(y) for x in doc_ids for y in x) ValueError: max() arg is an empty sequence

Please tell me how I should modify it? I only want to evaluate the search performance of ragflow using metrics. @Magicbook1108

Dec 09 '25 03:12 Monkeycrop

The error happens because the qrels dictionary (which holds the relevance judgments for evaluation) is empty—this usually means the dataset files are missing, empty, or not in the expected format. The benchmark.py script builds qrels directly from the ms_marco_v1.1 parquet files in your dataset directory. These files must contain the columns: query, passages.is_selected, and passages.passage_text. If these are missing or misnamed, qrels will be empty and you'll get the ValueError you saw.

To fix this, check that your ./datasets/ms_marco_v1.1 directory contains the correct parquet files and that each file has the required structure. You can also inspect the dumped .qrels.json and .run.json files in that directory to debug what was actually loaded. Ensuring the dataset matches the expected format should resolve the error and allow you to evaluate metrics like NDCG@10, MAP@5, and MRR@10.
See benchmark.py code reference

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 09 '25 03:12 dosubot[bot]

I haven't had experience with benchmark.py myself, so I may need to wait for my colleague to share more insights and provide you with a more helpful answer.

Dec 09 '25 05:12 Magicbook1108

How about this:

python3 rag/benchmark.py --max_docs=10000 --kb_id=db287ed0d4ae11f0b0ee6e0da93083d1 --dataset=ms_marco_v1.1 --dataset_path=./datasets/ms_marco_v1.1

Dec 09 '25 07:12 KevinHuSh

How about this:

python3 rag/benchmark.py --max_docs=10000 --kb_id=db287ed0d4ae11f0b0ee6e0da93083d1 --dataset=ms_marco_v1.1 --dataset_path=./datasets/ms_marco_v1.1

This syntax is not supported. Looking at my previous error messages, I felt that there was a problem with the dataset or that the dataset I downloaded had missing data.

root@huyarag-ragflow-9dwjc:/ragflow# python3 rag/benchmark.py --max_docs=10000 --kb_id=db287ed0d4ae11f0b0ee6e0da93083d1 --dataset=ms_marco_v1.1 --dataset_path=./datasets/ms_marco_v1.1 /ragflow/.venv/lib/python3.10/site-packages/google/api_core/_python_version_support.py:266: FutureWarning: You are using a Python version (3.10.12) which Google will stop supporting in new releases of google.api_core once it reaches its end of life (2026-10-04). Please upgrade to the latest Python version, or at least Python 3.11, to continue receiving updates for google.api_core past that date. warnings.warn(message, FutureWarning) WARNING:root:SECURITY WARNING: Using auto-generated SECRET_KEY. WARNING:root:Load term.freq FAIL! WARNING:root:Realtime synonym is disabled, since no redis connection. WARNING:root:Load term.freq FAIL! WARNING:root:Realtime synonym is disabled, since no redis connection. *****************RAGFlow Benchmark***************** usage: benchmark.py <max_docs> <kb_id> <dataset> <dataset_path> [<miracl_corpus_path>]) benchmark.py: error: the following arguments are required: max_docs, kb_id, dataset, dataset_path

Dec 10 '25 03:12 Monkeycrop

I'm having trouble using the benchmark function; it keeps throwing errors. @KevinHuSh @dosu

root@huyarag-ragflow-9dwjc:/ragflow# python3 rag/benchmark.py 10000 df997ba2d66311f0a6c46e0da93083d1 ms_marco_v1.1 ./dataset /ragflow/.venv/lib/python3.10/site-packages/google/api_core/_python_version_support.py:266: FutureWarning: You are using a Python version (3.10.12) which Google will stop supporting in new releases of google.api_core once it reaches its end of life (2026-10-04). Please upgrade to the latest Python version, or at least Python 3.11, to continue receiving updates for google.api_core past that date. warnings.warn(message, FutureWarning) *****************RAGFlow Benchmark***************** ERROR:root:TenantLLMService.split_model_name_and_factory got exception: 'NoneType' object is not iterable Traceback (most recent call last): File "/ragflow/api/db/services/tenant_llm_service.py", line 79, in split_model_name_and_factory model_providers = set([f["name"] for f in model_factories]) TypeError: 'NoneType' object is not iterable ERROR:root:TenantLLMService.split_model_name_and_factory got exception: 'NoneType' object is not iterable Traceback (most recent call last): File "/ragflow/api/db/services/tenant_llm_service.py", line 79, in split_model_name_and_factory model_providers = set([f["name"] for f in model_factories]) TypeError: 'NoneType' object is not iterable ERROR:root:TenantLLMService.split_model_name_and_factory got exception: 'NoneType' object is not iterable Traceback (most recent call last): File "/ragflow/api/db/services/tenant_llm_service.py", line 79, in split_model_name_and_factory model_providers = set([f["name"] for f in model_factories]) TypeError: 'NoneType' object is not iterable Traceback (most recent call last): File "/ragflow/rag/benchmark.py", line 294, in <module> ex = Benchmark(kb_id) File "/ragflow/rag/benchmark.py", line 44, in __init__ self.embd_mdl = LLMBundle(self.kb.tenant_id, LLMType.EMBEDDING, llm_name=self.kb.embd_id, lang=self.kb.language) File "/ragflow/api/db/services/llm_service.py", line 84, in __init__ super().__init__(tenant_id, llm_type, llm_name, lang, **kwargs) File "/ragflow/api/db/services/tenant_llm_service.py", line 252, in __init__ self.mdl = TenantLLMService.model_instance(tenant_id, llm_type, llm_name, lang=lang, **kwargs) File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3128, in inner return fn(*args, **kwargs) File "/ragflow/api/db/services/tenant_llm_service.py", line 132, in model_instance model_config = TenantLLMService.get_model_config(tenant_id, llm_type, llm_name) File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3128, in inner return fn(*args, **kwargs) File "/ragflow/api/db/services/tenant_llm_service.py", line 120, in get_model_config raise LookupError(f"Model({mdlnm}@{fid}) not authorized") LookupError: Model(text-embedding-3-small@OpenAI@None) not authorized

Dec 11 '25 07:12 Monkeycrop

Your errors are happening because the benchmark script isn't initializing the global model settings before it runs. To fix this, add the following at the very top of rag/benchmark.py (before any other imports or logic):

from common import settings
settings.init_settings()

This will load your model configuration from conf/llm_factories.json and prevent the 'NoneType' and 'Model not authorized' errors you’re seeing. Make sure conf/llm_factories.json exists and includes the 'OpenAI' factory with 'text-embedding-3-small' listed as an embedding model. Also, double-check that your knowledge base (kb_id) is set up to use an authorized embedding model for your tenant.

After making these changes, re-run the benchmark script. If you still get errors, check your knowledge base and model settings for mismatches or missing authorizations, especially if there have been recent changes to model switching logic in your setup.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 11 '25 07:12 dosubot[bot]

Now I'm back to the initial error message. Could you please provide a solution? I want to evaluate the retrieval performance of ragflow. @dosu

root@huyarag-ragflow-9dwjc:/ragflow# python3 rag/benchmark.py 10000 df997ba2d66311f0a6c46e0da93083d1 ms_marco_v1.1 ./dataset /ragflow/.venv/lib/python3.10/site-packages/google/api_core/_python_version_support.py:266: FutureWarning: You are using a Python version (3.10.12) which Google will stop supporting in new releases of google.api_core once it reaches its end of life (2026-10-04). Please upgrade to the latest Python version, or at least Python 3.11, to continue receiving updates for google.api_core past that date. warnings.warn(message, FutureWarning) WARNING:root:SECURITY WARNING: Using auto-generated SECRET_KEY. WARNING:root:Load term.freq FAIL! WARNING:root:Realtime synonym is disabled, since no redis connection. WARNING:root:Load term.freq FAIL! WARNING:root:Realtime synonym is disabled, since no redis connection. *****************RAGFlow Benchmark***************** Tokenizing:msmarco_passage.parquet: 0it [00:00, ?it/s] Traceback (most recent call last): File "/ragflow/rag/benchmark.py", line 302, in <module> ex(dataset, dataset_path) File "/ragflow/rag/benchmark.py", line 249, in __call__ print(dataset, evaluate(Qrels(qrels), Run(run), ["ndcg@10", "map@5", "mrr@10"])) File "/ragflow/.venv/lib/python3.10/site-packages/ranx/data_structures/qrels.py", line 76, in __init__ max_len = max(len(y) for x in doc_ids for y in x) ValueError: max() arg is an empty sequence

Dec 11 '25 07:12 Monkeycrop

The error happens because your dataset files are missing data or are not in the format RAGFlow expects. Each row in your Parquet files must have a non-empty "query" string and a "passages" dictionary with two lists: "is_selected" (relevance labels) and "passage_text" (passages), both non-empty and of equal length. If any row is missing these, qrels will be empty and you'll get this ValueError during evaluation.

You can check and clean your dataset with this script (update the path as needed):

import pandas as pd
import os

dataset_dir = "./datasets/ms_marco_v1.1"  # update to your dataset path

for fn in os.listdir(dataset_dir):
    if not fn.endswith(".parquet"):
        continue
    df = pd.read_parquet(os.path.join(dataset_dir, fn))
    def valid(row):
        p = row["passages"]
        return (
            isinstance(p, dict)
            and "is_selected" in p and "passage_text" in p
            and isinstance(p["is_selected"], list)
            and isinstance(p["passage_text"], list)
            and len(p["is_selected"]) == len(p["passage_text"]) > 0
        )
    cleaned = df[df.apply(valid, axis=1)]
    print(f"{fn}: {len(df)} rows, {len(cleaned)} valid rows")
    cleaned.to_parquet(os.path.join(dataset_dir, fn + ".cleaned"))

After cleaning, re-run the benchmark using the .cleaned Parquet files. If you still get the error, your dataset may be incomplete or not in the required schema. In that case, you can generate a valid dataset using the official MS MARCO v1.1 validation set and qrels with this script:
Full conversion script and explanation

Once your dataset is valid, the benchmark will run and produce metrics like NDCG@10, MAP@5, and MRR@10 without this error.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 11 '25 07:12 dosubot[bot]

This method didn't work; I've already tried it, and I'm still getting this error. @dosu

Dec 11 '25 09:12 Monkeycrop

This error is happening because the qrels structure is still empty or contains queries with no documents when you reach the evaluation step. This can happen even after cleaning if the dataset is missing passages, or if the code doesn't defensively skip queries with empty or mismatched passages lists. The current benchmark.py does not check for this, so you need to add a defensive block right before evaluation.

Try adding this code just before the call to evaluate(Qrels(qrels), ...) in benchmark.py:

# Remove queries with empty qrels before evaluation
qrels = {q: docs for q, docs in qrels.items() if len(docs) > 0}
if not qrels:
    raise ValueError("No valid qrels found. Check your dataset formatting and content.")

This will ensure only queries with at least one document are passed to ranx, and will give you a clearer error if the dataset is still empty.

If you still get the error, please check:

The number of rows in your cleaned Parquet files.
The contents of the <dataset>.qrels.json and <dataset>.run.json files (if they exist).
A sample row from your cleaned Parquet file, including the full structure of the query and passages fields.

If your dataset is empty after cleaning, you may need to regenerate it using the official MS MARCO validation set and qrels, as described in this script and explanation.

Let me know what you find after these steps!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 11 '25 09:12 dosubot[bot]

An error was indeed reported. @dosu

What is this code doing? self.tenant_id = "benchmark_ms_marco_v11" self.index_name = search.index_name(self.tenant_id) qrels, texts = self.ms_marco_index(file_path, "benchmark_ms_marco_v1.1") I don't have a tenant_id of benchmark_ms_marco_v11.

if dataset == "ms_marco_v1.1": self.tenant_id = "benchmark_ms_marco_v11" self.index_name = search.index_name(self.tenant_id) qrels, texts = self.ms_marco_index(file_path, "benchmark_ms_marco_v1.1") # Remove queries with empty qrels before evaluation qrels = {q: docs for q, docs in qrels.items() if len(docs) > 0} if not qrels: raise ValueError("No valid qrels found. Check your dataset formatting and content.") run = self._get_retrieval(qrels) print(dataset, evaluate(Qrels(qrels), Run(run), ["ndcg@10", "map@5", "mrr@10"])) self.save_results(qrels, run, texts, dataset, file_path)

Dec 11 '25 10:12 Monkeycrop

The root cause is that your dataset contains queries with no valid passages, so the qrels structure is empty or has entries with zero documents—this triggers the ValueError during evaluation. The ms_marco_index method in benchmark.py does not skip rows with empty or malformed passages, so even after cleaning, invalid entries can slip through and break evaluation (source).

To fix this, update benchmark.py so that ms_marco_index only adds to qrels if both 'is_selected' and 'passage_text' are non-empty lists of the same length, and skip any query where the zipped pairs are empty. Also, before evaluation, filter out any qrels entries with zero docs:

# After building qrels in ms_marco_index, add:
qrels = {q: docs for q, docs in qrels.items() if len(docs) > 0}
if not qrels:
    raise ValueError("No valid qrels found. Check your dataset formatting and content.")

If you want, share a sample row from your Parquet file (with 'query' and 'passages') or the contents of your .qrels.json file, and I can help pinpoint the formatting issue. If your dataset is still empty after these checks, you may need to regenerate it using the official MS MARCO validation set and qrels in the required format.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 11 '25 10:12 dosubot[bot]

I found that the dataset I downloaded is empty. Could you please provide me with a correct script to download the ms_marco_v1.1 dataset? @dosu

Dec 11 '25 10:12 Monkeycrop

You can download the MS MARCO v1.1 dataset directly from Hugging Face and convert it to the Parquet format expected by RAGFlow's benchmark.py using the following script. This will create Parquet files where each row has a "query" string and a "passages" dictionary with "is_selected" and "passage_text" lists of equal length.

First, install the required dependencies:

pip install datasets pandas pyarrow

Then run this script (adjust the output path as needed):

from datasets import load_dataset
import pandas as pd
import os

# Download the validation split (commonly used for benchmarking)
ds = load_dataset("microsoft/ms_marco", "v1.1", split="validation")

# Prepare output directory
os.makedirs("./datasets/ms_marco_v1.1", exist_ok=True)

# Convert to the expected format and save as Parquet
records = []
for item in ds:
    # Each item has: 'query', 'passages', etc.
    passages = item["passages"]
    # Only keep if both lists are non-empty and of equal length
    if (
        isinstance(passages, dict)
        and "is_selected" in passages
        and "passage_text" in passages
        and isinstance(passages["is_selected"], list)
        and isinstance(passages["passage_text"], list)
        and len(passages["is_selected"]) == len(passages["passage_text"]) > 0
    ):
        records.append({
            "query": item["query"],
            "passages": {
                "is_selected": passages["is_selected"],
                "passage_text": passages["passage_text"]
            }
        })

df = pd.DataFrame(records)
df.to_parquet("./datasets/ms_marco_v1.1/msmarco_passage.parquet")

After running this, you should have a Parquet file in ./datasets/ms_marco_v1.1/ compatible with RAGFlow's benchmark.py. If you want to use a different split (like "train"), change the split argument in load_dataset.

Check the Hugging Face dataset card for any licensing or usage restrictions: https://huggingface.co/datasets/microsoft/ms_marco

If you still get empty results, double-check that the output Parquet file is not empty and contains the expected fields.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 11 '25 10:12 dosubot[bot]

The public dataset test was successful, but now I want to build my own dataset to evaluate retrieval performance. How should I build it, and what format should the dataset be in? @dosu

Dec 17 '25 03:12 Monkeycrop

To evaluate retrieval performance with your own dataset in RAGFlow, you need to create a Parquet file (or files) where each row has:

a "query" string (your question or search prompt)
a "passages" dictionary with two lists:
- "is_selected": a list of 0/1 labels (1 = relevant, 0 = not relevant) for each passage
- "passage_text": a list of passage strings (candidate answers or document chunks)

Both lists must be non-empty and of the same length for every row. This format matches the MS MARCO v1.1 and TriviaQA schemas supported by benchmark.py (reference, discussion).

Here's a minimal example for two queries:

import pandas as pd

data = [
    {
        "query": "What is the capital of France?",
        "passages": {
            "is_selected": [1, 0, 0],
            "passage_text": [
                "Paris is the capital of France.",
                "Berlin is the capital of Germany.",
                "Madrid is the capital of Spain."
            ]
        }
    },
    {
        "query": "Who wrote Hamlet?",
        "passages": {
            "is_selected": [0, 1, 0],
            "passage_text": [
                "Charles Dickens wrote Oliver Twist.",
                "William Shakespeare wrote Hamlet.",
                "J.K. Rowling wrote Harry Potter."
            ]
        }
    }
]

df = pd.DataFrame(data)
df.to_parquet("my_custom_dataset.parquet")

Place your Parquet file(s) in a directory (e.g., ./datasets/my_custom_dataset/) and run:

python3 rag/benchmark.py <max_docs> <kb_id> ms_marco_v1.1 ./datasets/my_custom_dataset

Tips:

For each query, include all candidate passages you want to evaluate, and set "is_selected" to 1 for relevant passages and 0 for others.
Avoid empty or mismatched lists—every row must have at least one passage, and both lists must be the same length.
You can generate queries and passages from your own data, QA pairs, or document chunks—just ensure the format matches above.

If you need to benchmark in a different schema (e.g., TriviaQA or MIRACL), the requirements are similar but with different field names or file formats (details).

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 17 '25 03:12 dosubot[bot]

Do I need to upload and parse the document information into the <kb_id> to be retrieved in advance? @dosu

Dec 17 '25 03:12 Monkeycrop

No, you do not need to upload or parse documents into the <kb_id> knowledge base in advance. When you run benchmark.py, it automatically ingests and indexes the evaluation dataset into the specified knowledge base as part of the benchmark process. The script reads your dataset files, embeds the passages, and inserts them into the knowledge base index before running the retrieval evaluation—so you only need to provide a valid <kb_id> and the dataset in the required format. Just make sure your knowledge base exists and is properly configured with the correct embedding model and permissions.
See benchmark.py code reference

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Dec 17 '25 03:12 dosubot[bot]