docling icon indicating copy to clipboard operation
docling copied to clipboard

Landscape pages are not read

Open mohamed99akram opened this issue 11 months ago • 6 comments

Bug

I have a document of 39 pages, the orientation is portrait for 29 pages and landscape for 10 others. The text itself is normal (vertical, not rotated) only the orientation is different. Docling doesn't read the landscape pages. All pages have tables in them, tables are not read correctly either. However, for portrait pages, tables are read fine.

Steps to reproduce

A PDF file that has multiple orientations, one portrait and one landscape. then convert PDF to markdown.

Docling version

2.8.3

Python version

3.10.14

mohamed99akram avatar Jan 06 '25 12:01 mohamed99akram

@mohamed99akram could you please provide a sample document to reproduce the issue

nikos-livathinos avatar Jan 06 '25 13:01 nikos-livathinos

Hi Nikos have the same problem. I provide you an example for a landscape pdf. Some pages are working fine, others are not working at all. Marketing.pdf

JeandeBalzac avatar Jan 06 '25 15:01 JeandeBalzac

After checking closer, @JeandeBalzac your issue does not appear to be connected to portrait layout. It is simply because there are many elements identified as figures, and these will export as bitmap resources in the markdown / HTML. The contained text elements of figures are in the JSON representation of the DoclingDocument but not exported to the other formats by default.

cau-git avatar Jan 13 '25 18:01 cau-git

Hi. Yes we are aware, that the pages are included as images. However, our goal is to extract text and not images. Therefore, this is still a bug for us. I can also provide another landscape pdf, which is messed up quite a bit. We analyzed the problem. x and y are changed, when landscape and x works differently. X increases from right to left and not as usal in portrait form left to right. Moreover, top-left point is no longer top-left. The same is true for right-bottom point.

JeandeBalzac avatar Jan 18 '25 20:01 JeandeBalzac

@JeandeBalzac if you have more affected PDFs please attach them here, we need to analyze this problem more broadly.

cau-git avatar Jan 29 '25 13:01 cau-git

@cau-git Any update on this issue? My dataset has majority vertical PDF, but some are rotated landscape. I wish Docling could detect that, rotate and then extract.

benzhang-se avatar Feb 07 '25 23:02 benzhang-se

@benzhang-se The core problem is representing and extracting picture contents. We are actively working on creating datasets and models for this purpose. Once available it will be announced in the release notes. I will close this issue for now.

cau-git avatar May 21 '25 10:05 cau-git