Christoph Auer
Christoph Auer
Note: Current version of docling (2.17.0) has the headers ordering sorted, but we are continuing to work on proper key-value placemement. Output:
Checking the attached PDF, it is not a surprise we see very long conversion time. It is fully scanned and has a lot of pages, which is very slow on...
Hi @Manuel030, yes, this is indeed a problem with MS Office formats we are aware of. Let us have an iteration on this topic to see if we can find...
@pankpy Could you please provide an example to illustrate the behaviour? Thanks.
@aodingpeng I will investigate this issue. My suspicion is that the layout of this page is wrongly detected as a full page picture, hence all content in the detected picture...
@aodingpeng The current version of docling (2.17.0) treats your sample case better now. Certainly not perfect but you will get some meaningful text out. I will close this issue until...
@jerbob92 @BelaidCH we have a version in the works that will enable to get the in-picture content out, it will be released by end of next week. I will post...
This has since been implemented and is ready to use.
@Raphilanthrope This is obsolete since docling 2.13.0 because the layout_utils code is entirely replaced.
@Manuel030 @maxmnemonic There is apparently a newer PR with the same goal here: https://github.com/docling-project/docling/pull/1610 which has the proper condition to not produce empty text paragraphs.