docling
docling copied to clipboard
fix: pptx line break and space handling
The PR improves the handling of LineBreaks and spaces in the pptx backend. LineBreaks were ignored before the PR and are now replaced by a space. Empty TextRuns were also ignored which leads to problems if there is only one empty TextRun between two words.
The test file powerpoint_bad_text.pptx currently outputs:
# X-LibraryThe fully customisable and copyright-freestandardcontenttemplatecollectionexclusivelyforourcustomers
With the PR:
# X-Library The fully customisable and copyright-free standard content template collection exclusively for our customers
Checklist:
- [ ] Documentation has been updated, if necessary.
- [ ] Examples have been added, if necessary.
- [x] Tests have been added, if necessary.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
🟢 Require two reviewer for test updates
Wonderful, this rule succeeded.
When test data is updated, we require two reviewers
- [X]
#approved-reviews-by >= 2
Codecov Report
Attention: Patch coverage is 96.42857% with 1 line in your changes missing coverage. Please review.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| docling/backend/mspowerpoint_backend.py | 96.42% | 1 Missing :warning: |
:loudspeaker: Thoughts on this report? Let us know!