docling icon indicating copy to clipboard operation
docling copied to clipboard

fix: pptx line break and space handling

Open mawi12345 opened this issue 6 months ago • 2 comments

The PR improves the handling of LineBreaks and spaces in the pptx backend. LineBreaks were ignored before the PR and are now replaced by a space. Empty TextRuns were also ignored which leads to problems if there is only one empty TextRun between two words.

The test file powerpoint_bad_text.pptx currently outputs:

# X-LibraryThe fully customisable and copyright-freestandardcontenttemplatecollectionexclusivelyforourcustomers

With the PR:

# X-Library The fully customisable and copyright-free standard content template collection exclusively for our customers

Checklist:

  • [ ] Documentation has been updated, if necessary.
  • [ ] Examples have been added, if necessary.
  • [x] Tests have been added, if necessary.

mawi12345 avatar May 27 '25 09:05 mawi12345

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • [X] #approved-reviews-by >= 2

mergify[bot] avatar May 27 '25 09:05 mergify[bot]

Codecov Report

Attention: Patch coverage is 96.42857% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/mspowerpoint_backend.py 96.42% 1 Missing :warning:

:loudspeaker: Thoughts on this report? Let us know!

codecov[bot] avatar May 27 '25 18:05 codecov[bot]