fix(mineru): use cached img_path in crop() to consume generated_images; manual.py tag parsing patched (other parsers unchanged)
What problem does this PR solve?
MinerU (vlm-http-client mode) does not generate images for pure text blocks.
The fallback _generate_missing_images creates these images but they are never
consumed because _transfer_to_sections only returns (text, line_tag) tuples,
discarding img_path. When tokenize_chunks calls crop(), it re-crops from
page_images and concatenates multiple positions into super-tall images.
Solution
Bypass _transfer_to_sections by caching line_tag -> img_path mappings and
checking them in crop():
- Add
_img_path_cachedict in__init__ - Populate cache in
_generate_missing_imagesafter generating fallback images - In
crop(), extract position tags from text and check cache first - If cached image exists, return it directly (no concatenation)
- Fallback to single-position cropping to avoid super-tall merged images
Changes
deepdoc/parser/mineru_parser.py: +43 lines, -9 lines
Testing
Parse a PDF with MinerU that has text-only pages. Verify:
generated_images/directory contains per-block images- Chunk previews show correct thumbnails (not super-tall merged images)
Type of change
- [ ✅] Bug Fix (non-breaking change which fixes an issue)
- [ ] New Feature (non-breaking change which adds functionality)
- [ ] Documentation Update
- [ ] Refactoring
- [ ] Performance Improvement
- [ ] Other (please describe):
The current implementation of the crop function, along with the MinerU merging strategy, does not fully align with MinerU’s actual design assumptions. In particular, it does not adequately account for the integration paths required to properly utilize mineru-api and mineru-vllm. I am evaluating a follow-up update to include in this PR, as the present fix still leaves room for potential inconsistencies—for example, cases where images returned from the mineru-api service cannot be consumed correctly.
If necessary, please consider a more fundamental refactoring of the crop logic to ensure long-term correctness and compatibility across MinerU’s components.
Apreciations! There're Chinese comments.
In the latest commit, I refactored the mienru_parser Crop method. The new implementation now produces output compatible with the mineru mode. Three major changes were made:
- Fallback page-width image generation
A page-width fallback image is now generated for mineru text blocks based on their box coordinates. The new Crop pipeline uses this fallback image and normalizes height by page width, reducing fragmentation in mineru’s original text reconstruction and improving downstream stitching quality.
- Native Image Mapping
A mapping from tag → native_img_path is introduced to supply corresponding native images for text blocks. Example:
{
"@@1\t100.0\t500.0\t400.0\t800.0##": "/path/auto/tables/table_0.jpg",
"@@1\t100.0\t500.0\t900.0\t1200.0##": "/path/auto/images/image_1.jpg"
}
- Crop refactoring Now,look this:
Note:My apologies to the team — some Chinese text slipped through in this submission. I’ve been quite drained lately and honestly didn’t have the energy to clean it up thoroughly. If anything is unclear, please feel free to use a translator. Thanks for your patience.
注意:向项目组抱歉,这次的提交里有些中文没有删掉。我现在有点疲惫,说实话已经没什么力气再来回清理了。如果看不懂,还请大家借助一下翻译工具,辛苦了。
graph TD
A[crop调用] --> B[提取所有positions]
B --> C{遍历每个position}
C --> D[构造tag]
D --> E{tag已处理?}
E -->|Yes| F[跳过-去重]
E -->|No| G[标记seen_tags]
G --> H{查找native图}
H -->|找到| I[添加到列表: native]
H -->|未找到| J{查找缓存图}
J -->|找到| K[添加到列表: cached]
J -->|未找到| L{page_images可用?}
L -->|Yes| M[添加到列表: fullpage]
L -->|No| N[跳过此position]
I --> O{所有position处理完?}
K --> O
M --> O
N --> O
F --> O
O -->|Yes| P[智能拼接]
O -->|No| C
P --> Q{图片数 > 10?}
Q -->|Yes| R[均匀采样到10张]
Q -->|No| S[保持原数量]
R --> T{累计高度 > 2000px?}
S --> T
T -->|Yes| U[截断到2000px]
T -->|No| V[保持全部]
U --> W[垂直拼接-GAP=6px]
V --> W
W --> X[返回缩略图]
看不懂中文请用翻译 The Crop flow is rebuilt as: construct tags → deduplicate → perform intelligent stitching using three priority levels (native image → page-width strip → full-page image). The stitching behavior is controlled by:
- MAX_COUNT: sampling upper bound for strip images
- MAX_HEIGHT: maximum allowed total height to avoid overly tall outputs
- GAP: interval between sampled images
Sampling strategy example: From 13 images → sample down to 10:
sampled = [
images[0], # first
images[2],
images[4],
images[6],
images[8],
images[9],
images[10],
images[11],
images[12] # last
]
Truncation strategy:
Stop accumulating once reaching 2000 px:
current_height = 0 for img in images: if current_height + img.height > 2000: break current_height += img.height + 6
- MinerU compatibility fix in manual.py (This fix is effective only in manual.py; other parsers remain unchanged.)
File: rag/app/manual.py Lines: 233–268
Issue: MinerU may generate tags using spaces as separators (e.g., @@3 127.0 430.0), while the original code only handled tab-delimited formats (\t). When parsing fails, the logic falls back to the default tuple:
(page, 0.0, 0.0, 0.0, 0.0)
This caused all chunk thumbnails to incorrectly display as if they came from page 1.
Fix: Added a MinerU-specific fallback parser. If the standard parser fails, the logic now attempts a more permissive \s+ pattern to match both tabs and spaces: Effect: Correctly preserves MinerU’s bbox coordinates, ensuring thumbnails display the right page and region.
Note: This adjustment is applied only to manual.py. Other custom parsers (e.g., tcadp, docling) still assume their original tag formats and may require similar handling if they encounter space-delimited tags.
KevinHuSh @KevinHuSh
Hi, @shaoqing404
Do you have any screenshots to show your results? I can see there is incorrect position mapping for the
pipelinebackend.And this does not happen without your modification.
![]()
Confirming our version: mineru 2.6.6
The test results under the vllm-http-client mode based on mineru-api are as follows:
before:
Hi, @shaoqing404 Do you have any screenshots to show your results? I can see there is incorrect position mapping for the
pipelinebackend.And this does not happen without your modification.
At the moment, the issue fixed in this PR only applies to the manual mode. Let me take a look at the annotation drifting problem you mentioned.
And this does not happen without your modification.