docling icon indicating copy to clipboard operation
docling copied to clipboard

docling identified my entire page as a picture

Open aodingpeng opened this issue 1 year ago • 4 comments

Bug

I need to identify this page, but it seems that Docling has recognized my page as an image

image

file: ISO IEC 23090-5DUP.pdf

Is there any way to solve this problem?

aodingpeng avatar Nov 18 '24 05:11 aodingpeng

This is a scanned document. You should use OCR argument to parse it.

mllife avatar Nov 18 '24 07:11 mllife

This is a scanned document. You should use OCR argument to parse it.

I added OCR to my final command, but the layout analysis still referred to the image

aodingpeng avatar Nov 18 '24 13:11 aodingpeng

这是扫描的文档。您应该使用 OCR 参数来解析它。 我是新手,请问如何才能启动源码?

aodingpeng avatar Nov 19 '24 04:11 aodingpeng

@aodingpeng I will investigate this issue. My suspicion is that the layout of this page is wrongly detected as a full page picture, hence all content in the detected picture is lost (so far Docling ignores in-picture text). OCR won't solve this alone.

cau-git avatar Nov 19 '24 08:11 cau-git

@aodingpeng The current version of docling (2.17.0) treats your sample case better now. Certainly not perfect but you will get some meaningful text out. I will close this issue until further notification.

cau-git avatar Jan 30 '25 14:01 cau-git