MinerU icon indicating copy to clipboard operation
MinerU copied to clipboard

--method ocr参数的作用是啥?什么场景下需要加这个参数?加这个参数代码片段会被识别成1行,不加的话正常识别原始格式

Open freedom1993 opened this issue 1 year ago • 4 comments

Description of the bug | 错误描述

原始内容**** image

加--method ocr参数解析结果 image

不加--method ocr参数解析结果 image

How to reproduce the bug | 如何复现

magic-pdf pdf-command --pdf agents.pdf --inside_model true --method ocr

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

freedom1993 avatar Aug 01 '24 11:08 freedom1993

--method ocr means use paddle to get text from pdf, --method text means use pymuPDF to get text from pdf.

The difference lies in that the bounding boxes obtained by pymupdf may expand irregularly in all directions, covering the surrounding text bounding boxes. This can lead to errors in position calculations. The text bounding boxes obtained through OCR are relatively reliable.

drunkpig avatar Aug 01 '24 11:08 drunkpig

@freedom1993 We will document this phenomenon you reported as a bug and investigate the root cause.

drunkpig avatar Aug 01 '24 11:08 drunkpig

@freedom1993 can you provide me this pdf?

drunkpig avatar Aug 01 '24 11:08 drunkpig