docling
docling copied to clipboard
Support pagination in MSWord documents
Requested feature
It's more or less possible to get pagination out of a DOCX file created by some versions of MSWord (notably not Word 365..) by looking at the <w:lastRenderedPageBreak/> elements. See https://ooxml.info/docs/17/17.3/17.3.3/17.3.3.13/
This is only partially supported by python-docx but we can just get it with XPath. I read CONTRIBUTING.md and I'm not supposed to do this, but I need the feature, so I made a PR anyway 😉 https://github.com/DS4SD/docling/pull/832
Alternatives
There is no alternative! No, not true - pagination is always approximate for DOCX since it isn't (exactly) a presentation format. So, if you want to really know the page number, then render to a PDF first. Now you have two problems!
Note that it is generally impossible to get accurate pagination out of OOXML (docx) So for this reason you may prefer not to do this! But even very approximate page numbers can still be useful.
Do you mean out CONTRIBUTING.md? We are very happy having the community building up these extensions. Thanks a lot for the contribution.
Do you mean out
CONTRIBUTING.md? We are very happy having the community building up these extensions. Thanks a lot for the contribution.
Ah, just because CONTRIBUTING.md mentions that you should start a discussion before making a PR :)
There are a couple of things in the PR that may need to be improved!
I believe the PR should be complete now but obviously needs review...
I am closing this issue, since we decided that "approximate" pagination in Word is not feasible to include.