ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Question]: DeepDoc + “Parse as Paper” Generates Only OCR Text; Missing LLM-Based Graph Descriptions (Works in General Parsing)

Open nikhilmgeorge7 opened this issue 1 month ago • 5 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Describe your problem

When using DeepDoc with the “Parse as Paper” option, the output generated in the graph view contains only OCR-extracted text from the PDF. However, no semantic / LLM-generated description is created for the graph nodes.

In contrast, when the exact same file is processed using “Parse as General”, the system successfully generates:

OCR extraction

and an LLM-generated description / summary for each graph node

This means that Paper parsing mode appears to disable or skip the LLM description step, resulting in graph nodes that are incomplete and less useful for retrieval or downstream reasoning.

At the moment, “Parse as Paper” = OCR only, while “Parse as General” = OCR + LLM description.

This behavior is unexpected because Paper mode is typically used for structured scientific documents, where semantic descriptions are even more important.

Steps to Reproduce

Upload any PDF with structured content (e.g., a research paper).

Choose DeepDoc as the parser.

Select “Parse as Paper.”

After parsing completes, open the Graph View.

Observe that:

Each node contains only OCR text chunks.

The field where a semantic description normally appears is missing or empty.

Re-upload or re-parse the same file using “Parse as General.”

In Graph View:

Each node now includes not only OCR text but also the LLM-generated description, as expected.

Expected Behavior

DeepDoc + Parse as Paper should produce both:

OCR-extracted raw text

LLM-based description / summarization, exactly like General parsing unless there is a setting explicitly disabling it.

The graph should contain a semantic description field that enhances retrieval quality and document understanding.

Actual Behavior

Parse as Paper output:

Only plain OCR text in graph nodes

No LLM summary or description generated

Graph nodes appear incomplete compared to the General mode

Parse as General output:

OCR text + full LLM description

Graph nodes are enriched and more usable

This issue affects:

Document comprehension

RAG quality (embeddings rely heavily on semantic descriptions)

Accuracy of downstream QA

Graph-based retrieval usefulness

Especially for research papers, losing LLM descriptions significantly reduces the value of DeepDoc’s processing.

nikhilmgeorge7 avatar Nov 24 '25 13:11 nikhilmgeorge7

The reason DeepDoc's "Parse as Paper" mode only outputs OCR text (without LLM-based semantic descriptions for graph nodes) is due to its default parser configuration: semantic enrichment steps like RAPTOR and Graphrag are disabled by default in this mode. In contrast, "Parse as General" mode enables these steps, so you get both OCR and LLM-generated summaries in graph nodes.

To enable LLM-based descriptions in "Parse as Paper" mode, you need to update the parser configuration for your dataset or document to set use_raptor: true (and optionally use_graphrag: true). This activates the semantic enrichment pipeline, so graph nodes will include LLM-generated summaries just like in General mode. This change is not currently exposed in the UI—you'll need to update the config via the API or database directly. See the HTTP API reference for details on the relevant fields.

Enabling these flags will improve document comprehension, retrieval quality, and downstream QA by enriching graph nodes with semantic context, which is especially valuable for structured scientific PDFs.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 24 '25 13:11 dosubot[bot]

Parse as Paper Ignores use_raptor: true for LLM Descriptions Despite setting use_raptor: true in the parser configuration, the DeepDoc "Parse as Paper" mode still fails to generate LLM-based semantic descriptions/summaries for graph nodes. This only leaves the raw OCR text, which significantly reduces the utility of "Paper" parsing.

The behavior is correct when using "chunk_method": "general". The following code was executed before initiating the parsing process:

target_doc.update({ "chunk_method": "paper", "parser_config": { "use_raptor": True, # Explicitly set to enable LLM summarization "use_graphrag": True # Also enabled } })

nikhilmgeorge7 avatar Nov 25 '25 13:11 nikhilmgeorge7

This is a design limitation: DeepDoc’s “Parse as Paper” mode does not trigger LLM-based semantic enrichment (RAPTOR/GraphRAG) during standard chunking, even if you set use_raptor: true in the parser config. The code path for “Parse as Paper” only performs OCR and structural parsing; it never calls the semantic enrichment steps unless you explicitly run a RAPTOR or GraphRAG task after parsing. The only way to get LLM summaries for graph nodes in “Paper” mode is to manually trigger these enrichment tasks via the API or task runner after the initial parse. For reference, the parser config for “paper” mode hardcodes use_raptor and use_graphrag to false by default, and the chunking logic ignores these flags unless a graph enrichment task is run see source.

If you want seamless LLM enrichment in “Parse as Paper,” you’ll need to adjust your workflow to include a RAPTOR or GraphRAG step after parsing. This isn’t a bug, but it’s not well-documented—so it’s worth flagging for future improvements.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 25 '25 13:11 dosubot[bot]

what's your running version or commit id ?

Magicbook1108 avatar Nov 26 '25 02:11 Magicbook1108

what's your running version or commit id ?

v0.20.5 slim

nikhilmgeorge7 avatar Nov 26 '25 05:11 nikhilmgeorge7

Raise a feature request for this issue please. Product manager team will evaluate its priority.

Magicbook1108 avatar Dec 15 '25 07:12 Magicbook1108