ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: How to implement ReferenceExtractor component in Ingestion Pipeline

Open aaronchen2k opened this issue 2 months ago • 15 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

I want dev a ReferenceExtractor component in Ingestion Pipeline(ragflow V0.22) used to extract refs in each section as blow:

` import re

def extract_section_refs(text: str): # 匹配常见引用格式 patterns = [ r'参见[第]?(\d+.\d+(?:.\d+)?)节', r'详见[第]?(\d+.\d+(?:.\d+)?)节', r'见(\d+.\d+(?:.\d+)?)', r'如(\d+.\d+(?:.\d+)?)所述' ] refs = [] for pat in patterns: refs.extend(re.findall(pat, text)) return list(set(refs)) # 去重

class ReferenceExtractor(BaseComponent): def run(self, **kwargs): chunks = kwargs["chunks"] # 构建 section_id -> chunk 的映射(需上游提供) section_map = build_section_index(chunks)

    enriched_chunks = []
    for chunk in chunks:
        refs = extract_section_refs(chunk["text"])
        chunk["metadata"]["internal_refs"] = refs  # 关键:存储引用目标
        
        # 可选:预合并引用内容(谨慎使用,避免 chunk 过大)
        # ref_texts = [section_map.get(r, {}).get("text", "") for r in refs]
        # chunk["text"] += "\n[引用内容]: " + "\n".join(ref_texts)
        
        enriched_chunks.append(chunk)
    return {"chunks": enriched_chunks}`

aaronchen2k avatar Nov 16 '25 06:11 aaronchen2k

To implement a ReferenceExtractor component in the RAGFlow ingestion pipeline, create a new component that inherits from ComponentBase (the base class for pipeline components). Your main logic should go in the _invoke method, not run. Use regex patterns to extract section references from each chunk's text, and store the results in the chunk's metadata under an internal_refs field. There is no built-in build_section_index function, so you'll need to implement a utility that maps section identifiers (like section numbers) to chunk objects, typically using chunk metadata or content as keys.

Here's a simplified outline:

  1. Define a parameter class inheriting from ComponentParamBase for any configurable options.
  2. Implement your ReferenceExtractor class inheriting from ComponentBase. Put your extraction logic in the _invoke method.
  3. For each chunk, extract references using your regex patterns and update chunk["metadata"]["internal_refs"].
  4. Optionally, enrich chunk text with referenced section content, but be careful not to make chunks too large.
  5. Place your component in agent/component/ for auto-discovery.
  6. Implement the thoughts method to return a string summarizing the component's reasoning or status (required for all components) see example and requirements.

Chunk metadata can be set programmatically during ingestion, but keep in mind that chunk-level metadata is not persisted in the main database by default—only document-level metadata is. If you need persistent chunk-level metadata, you’ll need to extend the storage layer see details.

If you want to integrate this component into the UI, add it to the Operator enum, componentMenuList, and FormConfigMap in the frontend source files, and map it to a form component for UI representation. After backend and frontend changes, restart/rebuild both to load updates.

Let me know if you want a concrete code template or more details on any step!

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 16 '25 06:11 dosubot[bot]

@dosu you said "add it to the Operator enum, componentMenuList, and FormConfigMap in the frontend source files, and map it to a form component for UI representation. ", which files I need to edit?

aaronchen2k avatar Nov 17 '25 10:11 aaronchen2k

Here are the frontend files you need to edit to add your ReferenceExtractor component:

  • Operator enum: Add your new operator to web/src/constants/agent.tsx (see source).
  • FormConfigMap: Map your operator to its React form component in web/src/pages/agent/form-sheet/form-config-map.tsx (see source).
  • Operator selection menu (componentMenuList): Add your operator to the relevant lists in web/src/pages/agent/canvas/node/dropdown/accordion-operators.tsx and web/src/pages/agent/canvas/node/dropdown/operator-item-list.tsx (see source).

This will make your new component selectable and configurable in the UI. If you want custom icons or translations, you may also want to update the relevant icon and i18n files.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 17 '25 10:11 dosubot[bot]

@dosu thank you! I create an agent flow node called ReferenceExtractor with no parameters, how can I get the chunks from previous node, like {HierarchicalMerger:KindSwansFall@chunks} in 用户提示词 does?

aaronchen2k avatar Nov 18 '25 02:11 aaronchen2k

You can get the chunks from the previous node in your ReferenceExtractor by accessing the input argument in your component's _invoke (or run) method. When the upstream node outputs {"chunks": ...}, RAGFlow passes this as a keyword argument to your node. So, in your ReferenceExtractor, just use:

def _invoke(self, **kwargs):
    chunks = kwargs["chunks"]
    # your logic here

Alternatively, if you use self.get_input("chunks"), it will also retrieve the chunks input, as long as the output key from the previous node is mapped to the input key of your node in the flow configuration.

The {HierarchicalMerger:KindSwansFall@chunks} syntax in 用户提示词 is resolved by the backend to pass the correct output to your node. Just make sure the output key from the previous node is "chunks" and your ReferenceExtractor expects "chunks" as input—then you don't need any parameters in your node.

If the mapping is missing or the keys don't match, kwargs["chunks"] will raise a KeyError, so double-check your flow connections in the UI.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 18 '25 02:11 dosubot[bot]

@dosu Got a error "Extra inputs are not permitted [type=extra_forbidden, input_value=" when I use self.set_output("section_num_to_content_map", section_num_to_content_map) in a pipeline node,

how can I pass variables between each Ingestion Pipeline flow node?

aaronchen2k avatar Nov 19 '25 07:11 aaronchen2k

The error happens because RAGFlow uses strict Pydantic models for pipeline node input/output schemas, and by default, these models set extra="forbid". That means any output key you add (like section_num_to_content_map) that isn't explicitly declared in the schema for the downstream node will trigger the "Extra inputs are not permitted" error.

To pass custom variables between nodes, you need to add your new key to the Pydantic schema of the downstream node that will receive it. For example, if your next node expects section_num_to_content_map, edit its schema (e.g., rag/flow/extractor/schema.py or rag/flow/hierarchical_merger/schema.py) and add:

section_num_to_content_map: dict | None = Field(default=None)

If you want to allow arbitrary extra fields, you could set extra="allow" in the model_config, but this is not recommended for production pipelines because it weakens validation.

In short: declare your custom output fields in the relevant schema(s) for the downstream node(s) to pass variables between pipeline nodes without hitting the extra_forbidden error. This is the standard pattern for extending data flow in RAGFlow pipelines.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 19 '25 07:11 dosubot[bot]

@dosu I add an node called ReferenceExtractor In Ingestion Pipeline flow. After run this flow, in dataflow-result page, can not show any data in this new ReferenceExtractor node, why?

aaronchen2k avatar Nov 19 '25 07:11 aaronchen2k

The dataflow-result page only displays data for nodes whose type matches specific values (like parser, characterSplitter, titleSplitter, contextGenerator) and whose outputs follow a strict schema: the output must be under an outputs object, with an output_format key indicating which output to show (e.g., "chunks", "json", "text") and the actual data under that key. If your ReferenceExtractor node isn't mapped to a recognized type or doesn't structure its outputs this way, the UI won't render its results—even if the backend runs fine.

To fix this, make sure:

  1. Your node's output is structured as outputs with an output_format key and a corresponding value (e.g., outputs={"output_format": {"value": "chunks"}, "chunks": {"value": [...], "type": "chunks"}}).
  2. The node type is mapped in the frontend's TimelineNodeType enum and included in the rendering logic for the dataflow-result page.
  3. If you want to display custom metadata (like internal_refs), ensure it's included in the chunk objects and that the frontend is set up to render it.

If you skip any of these, the ReferenceExtractor node will remain invisible in the dataflow-result view. For more details, see how the UI expects and renders node outputs here and here.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 19 '25 07:11 dosubot[bot]

@dosu

In web/src/pages/dataflow-result/components/parse-editer, I use below code to print content.params.field_name const parserKey = useMemo(() => { console.log('------', content.params.field_name)

Where can I set "content.params.field_name" value of Ingestion Pipeline flow custom node in backend python code?

aaronchen2k avatar Nov 19 '25 08:11 aaronchen2k

You set content.params.field_name in the backend by defining field_name as an attribute in your node’s Param class (for example, ReferenceExtractorParam). When you build your pipeline DSL or instantiate the node, assign a value to field_name in the params dictionary. The backend serializes this param object and sends it to the frontend, so any attribute you set (like field_name) will appear as content.params.field_name in the UI.

Example:

class ReferenceExtractorParam(ComponentParamBase):
    def __init__(self):
        super().__init__()
        self.field_name = ""  # Add your custom param here

    def check(self):
        self.check_empty(self.field_name, "Result Destination")

Then, when constructing your pipeline (DSL or Python), set the value:

params = {"field_name": "internal_refs"}
param_obj = ReferenceExtractorParam()
param_obj.update(params)
# Pass param_obj to your ReferenceExtractor node

This pattern is used in built-in nodes like Extractor as well—see how ExtractorParam.field_name is set and used in the backend code (reference). The frontend reads the serialized param object, so your field_name value will be available as content.params.field_name.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 19 '25 08:11 dosubot[bot]

@dosu

I set the value of field_name to "text" as below, ` class ReferenceExtractorParam(ProcessParamBase, LLMParam): def init(self): super().init() self.field_name = "text"

def check(self):
    super().check()
    self.check_empty(self.field_name, "Result Destination")`

But in front page - web/src/pages/dataflow-result/components/parse-editer/json-parser.tsx, content.params.field_name is 'summary', why?

const parserKey = useMemo(() => { console.log('------', content.params.field_name)

aaronchen2k avatar Nov 19 '25 09:11 aaronchen2k

@dosu how ragflow v0.22 use chunk' questions metadata when do retrieval?

aaronchen2k avatar Nov 22 '25 05:11 aaronchen2k

RAGFlow v0.22 uses chunk 'questions' metadata in several ways during retrieval:

  • During ingestion, if a chunk has a 'questions' field, it gets split into 'question_kwd' (keywords) and tokenized into 'question_tks'. These fields are indexed and used for retrieval and ranking—'questions' itself is deleted before indexing, but its tokenized forms remain in the chunk metadata (source, source).

  • When embedding chunks, the system prefers the 'questions' field for semantic embedding if present, otherwise it falls back to 'summary' or 'text' (source).

  • During retrieval, the search and rerank logic gives high weight to matches on 'question_kwd' and 'question_tks', so chunks with relevant questions are more likely to be retrieved and ranked higher (source, source).

  • You do not need to manually configure anything for these fields to participate—they are used automatically in search and ranking. However, filtering or retrieval by custom chunk metadata (including 'questions') is not supported out of the box; chunk-level metadata is not persisted in the main database unless you extend the storage layer (source).

In short: 'questions' metadata is transformed and heavily used for retrieval relevance and ranking, but direct filtering by custom chunk metadata requires additional backend changes.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 22 '25 05:11 dosubot[bot]

you solved?

Magicbook1108 avatar Nov 27 '25 02:11 Magicbook1108