docling icon indicating copy to clipboard operation
docling copied to clipboard

fix: find paragraphs in elements with images in docx

Open Manuel030 opened this issue 7 months ago • 4 comments

Some text is not found when using the MsWordDocumentBackend. An example docx file where this happens is attached: paragraph_in_image.docx

The pragmatic solution is to attempt to add text elements even when a drawing expression is found.

Checklist:

  • [ ] Documentation has been updated, if necessary.
  • [ ] Examples have been added, if necessary.
  • [ ] Tests have been added, if necessary.

Manuel030 avatar Apr 28 '25 11:04 Manuel030

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • [ ] #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar Apr 28 '25 11:04 mergify[bot]

@Manuel030 Thank you for the PR! Could you add this document as a test?

PeterStaar-IBM avatar Apr 29 '25 04:04 PeterStaar-IBM

@PeterStaar-IBM Sure

Manuel030 avatar Apr 29 '25 14:04 Manuel030

@Manuel030 @maxmnemonic There is apparently a newer PR with the same goal here: https://github.com/docling-project/docling/pull/1610 which has the proper condition to not produce empty text paragraphs.

cau-git avatar May 23 '25 11:05 cau-git

closing this, superseded by https://github.com/docling-project/docling/pull/1610

PeterStaar-IBM avatar Aug 25 '25 08:08 PeterStaar-IBM