Landscape pages are not read
Bug
I have a document of 39 pages, the orientation is portrait for 29 pages and landscape for 10 others. The text itself is normal (vertical, not rotated) only the orientation is different. Docling doesn't read the landscape pages. All pages have tables in them, tables are not read correctly either. However, for portrait pages, tables are read fine.
Steps to reproduce
A PDF file that has multiple orientations, one portrait and one landscape. then convert PDF to markdown.
Docling version
2.8.3
Python version
3.10.14
@mohamed99akram could you please provide a sample document to reproduce the issue
Hi Nikos have the same problem. I provide you an example for a landscape pdf. Some pages are working fine, others are not working at all. Marketing.pdf
After checking closer, @JeandeBalzac your issue does not appear to be connected to portrait layout. It is simply because there are many elements identified as figures, and these will export as bitmap resources in the markdown / HTML. The contained text elements of figures are in the JSON representation of the DoclingDocument but not exported to the other formats by default.
Hi. Yes we are aware, that the pages are included as images. However, our goal is to extract text and not images. Therefore, this is still a bug for us. I can also provide another landscape pdf, which is messed up quite a bit. We analyzed the problem. x and y are changed, when landscape and x works differently. X increases from right to left and not as usal in portrait form left to right. Moreover, top-left point is no longer top-left. The same is true for right-bottom point.
@JeandeBalzac if you have more affected PDFs please attach them here, we need to analyze this problem more broadly.
@cau-git Any update on this issue? My dataset has majority vertical PDF, but some are rotated landscape. I wish Docling could detect that, rotate and then extract.
@benzhang-se The core problem is representing and extracting picture contents. We are actively working on creating datasets and models for this purpose. Once available it will be announced in the release notes. I will close this issue for now.