ragflow fix(mineru): use cached img_path in crop() to consume generated_images; manual.py tag parsing patched (other parsers unchanged)

What problem does this PR solve?

MinerU (vlm-http-client mode) does not generate images for pure text blocks. The fallback _generate_missing_images creates these images but they are never consumed because _transfer_to_sections only returns (text, line_tag) tuples, discarding img_path. When tokenize_chunks calls crop(), it re-crops from page_images and concatenates multiple positions into super-tall images.

Solution

Bypass _transfer_to_sections by caching line_tag -> img_path mappings and checking them in crop():

Add _img_path_cache dict in __init__
Populate cache in _generate_missing_images after generating fallback images
In crop(), extract position tags from text and check cache first
If cached image exists, return it directly (no concatenation)
Fallback to single-position cropping to avoid super-tall merged images

Changes

deepdoc/parser/mineru_parser.py: +43 lines, -9 lines

Testing

Parse a PDF with MinerU that has text-only pages. Verify:

generated_images/ directory contains per-block images
Chunk previews show correct thumbnails (not super-tall merged images)

Type of change

[ ✅] Bug Fix (non-breaking change which fixes an issue)
[ ] New Feature (non-breaking change which adds functionality)
[ ] Documentation Update
[ ] Refactoring
[ ] Performance Improvement
[ ] Other (please describe):

Dec 09 '25 12:12 shaoqing404

The current implementation of the crop function, along with the MinerU merging strategy, does not fully align with MinerU’s actual design assumptions. In particular, it does not adequately account for the integration paths required to properly utilize mineru-api and mineru-vllm. I am evaluating a follow-up update to include in this PR, as the present fix still leaves room for potential inconsistencies—for example, cases where images returned from the mineru-api service cannot be consumed correctly.

If necessary, please consider a more fundamental refactoring of the crop logic to ensure long-term correctness and compatibility across MinerU’s components.

Dec 09 '25 15:12 shaoqing404

Apreciations! There're Chinese comments.

Dec 10 '25 03:12 KevinHuSh

In the latest commit, I refactored the mienru_parser Crop method. The new implementation now produces output compatible with the mineru mode. Three major changes were made:

Fallback page-width image generation

A page-width fallback image is now generated for mineru text blocks based on their box coordinates. The new Crop pipeline uses this fallback image and normalizes height by page width, reducing fragmentation in mineru’s original text reconstruction and improving downstream stitching quality.

Native Image Mapping

A mapping from tag → native_img_path is introduced to supply corresponding native images for text blocks. Example:

{
  "@@1\t100.0\t500.0\t400.0\t800.0##": "/path/auto/tables/table_0.jpg",
  "@@1\t100.0\t500.0\t900.0\t1200.0##": "/path/auto/images/image_1.jpg"
}

Crop refactoring Now,look this:

Note：My apologies to the team — some Chinese text slipped through in this submission. I’ve been quite drained lately and honestly didn’t have the energy to clean it up thoroughly. If anything is unclear, please feel free to use a translator. Thanks for your patience.

注意：向项目组抱歉，这次的提交里有些中文没有删掉。我现在有点疲惫，说实话已经没什么力气再来回清理了。如果看不懂，还请大家借助一下翻译工具，辛苦了。

graph TD
    A[crop调用] --> B[提取所有positions]
    B --> C{遍历每个position}
    C --> D[构造tag]
    D --> E{tag已处理?}
    E -->|Yes| F[跳过-去重]
    E -->|No| G[标记seen_tags]
    
    G --> H{查找native图}
    H -->|找到| I[添加到列表: native]
    H -->|未找到| J{查找缓存图}
    
    J -->|找到| K[添加到列表: cached]
    J -->|未找到| L{page_images可用?}
    
    L -->|Yes| M[添加到列表: fullpage]
    L -->|No| N[跳过此position]
    
    I --> O{所有position处理完?}
    K --> O
    M --> O
    N --> O
    F --> O
    
    O -->|Yes| P[智能拼接]
    O -->|No| C
    
    P --> Q{图片数 > 10?}
    Q -->|Yes| R[均匀采样到10张]
    Q -->|No| S[保持原数量]
    
    R --> T{累计高度 > 2000px?}
    S --> T
    
    T -->|Yes| U[截断到2000px]
    T -->|No| V[保持全部]
    
    U --> W[垂直拼接-GAP=6px]
    V --> W
    
    W --> X[返回缩略图]

看不懂中文请用翻译 The Crop flow is rebuilt as: construct tags → deduplicate → perform intelligent stitching using three priority levels (native image → page-width strip → full-page image). The stitching behavior is controlled by:

MAX_COUNT: sampling upper bound for strip images
MAX_HEIGHT: maximum allowed total height to avoid overly tall outputs
GAP: interval between sampled images

Sampling strategy example: From 13 images → sample down to 10:

sampled = [
    images[0],   # first
    images[2],
    images[4],
    images[6],
    images[8],
    images[9],
    images[10],
    images[11],
    images[12]   # last
]

Truncation strategy: Stop accumulating once reaching 2000 px: current_height = 0 for img in images: if current_height + img.height > 2000: break current_height += img.height + 6

MinerU compatibility fix in manual.py (This fix is effective only in manual.py; other parsers remain unchanged.)

File: rag/app/manual.py Lines: 233–268

Issue: MinerU may generate tags using spaces as separators (e.g., @@3 127.0 430.0), while the original code only handled tab-delimited formats (\t). When parsing fails, the logic falls back to the default tuple:

(page, 0.0, 0.0, 0.0, 0.0)

This caused all chunk thumbnails to incorrectly display as if they came from page 1.

Fix: Added a MinerU-specific fallback parser. If the standard parser fails, the logic now attempts a more permissive \s+ pattern to match both tabs and spaces: Effect: Correctly preserves MinerU’s bbox coordinates, ensuring thumbnails display the right page and region.

Note: This adjustment is applied only to manual.py. Other custom parsers (e.g., tcadp, docling) still assume their original tag formats and may require similar handling if they encounter space-delimited tags.

Dec 10 '25 15:12 shaoqing404

KevinHuSh @KevinHuSh

Dec 10 '25 15:12 shaoqing404

Hi, @shaoqing404

Do you have any screenshots to show your results? I can see there is incorrect position mapping for the pipeline backend.
And this does not happen without your modification.

Confirming our version: mineru 2.6.6

The test results under the vllm-http-client mode based on mineru-api are as follows:

before：

Dec 11 '25 13:12 shaoqing404

Hi, @shaoqing404 Do you have any screenshots to show your results? I can see there is incorrect position mapping for the pipeline backend. And this does not happen without your modification.

At the moment, the issue fixed in this PR only applies to the manual mode. Let me take a look at the annotation drifting problem you mentioned.

Dec 11 '25 13:12 shaoqing404