--method ocr参数的作用是啥?什么场景下需要加这个参数?加这个参数代码片段会被识别成1行,不加的话正常识别原始格式
Description of the bug | 错误描述
原始内容****
加--method ocr参数解析结果
不加--method ocr参数解析结果
How to reproduce the bug | 如何复现
magic-pdf pdf-command --pdf agents.pdf --inside_model true --method ocr
Operating system | 操作系统
Windows
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.6.x
Device mode | 设备模式
cpu
--method ocr means use paddle to get text from pdf, --method text means use pymuPDF to get text from pdf.
The difference lies in that the bounding boxes obtained by pymupdf may expand irregularly in all directions, covering the surrounding text bounding boxes. This can lead to errors in position calculations. The text bounding boxes obtained through OCR are relatively reliable.
@freedom1993 We will document this phenomenon you reported as a bug and investigate the root cause.
@freedom1993 can you provide me this pdf?