docling icon indicating copy to clipboard operation
docling copied to clipboard

Support pagination in MSWord documents

Open dhdaines opened this issue 9 months ago • 4 comments
trafficstars

Requested feature

It's more or less possible to get pagination out of a DOCX file created by some versions of MSWord (notably not Word 365..) by looking at the <w:lastRenderedPageBreak/> elements. See https://ooxml.info/docs/17/17.3/17.3.3/17.3.3.13/

This is only partially supported by python-docx but we can just get it with XPath. I read CONTRIBUTING.md and I'm not supposed to do this, but I need the feature, so I made a PR anyway 😉 https://github.com/DS4SD/docling/pull/832

Alternatives

There is no alternative! No, not true - pagination is always approximate for DOCX since it isn't (exactly) a presentation format. So, if you want to really know the page number, then render to a PDF first. Now you have two problems!

dhdaines avatar Jan 29 '25 13:01 dhdaines

Note that it is generally impossible to get accurate pagination out of OOXML (docx) So for this reason you may prefer not to do this! But even very approximate page numbers can still be useful.

dhdaines avatar Jan 29 '25 14:01 dhdaines

Do you mean out CONTRIBUTING.md? We are very happy having the community building up these extensions. Thanks a lot for the contribution.

dolfim-ibm avatar Jan 29 '25 14:01 dolfim-ibm

Do you mean out CONTRIBUTING.md? We are very happy having the community building up these extensions. Thanks a lot for the contribution.

Ah, just because CONTRIBUTING.md mentions that you should start a discussion before making a PR :)

There are a couple of things in the PR that may need to be improved!

dhdaines avatar Jan 29 '25 16:01 dhdaines

I believe the PR should be complete now but obviously needs review...

dhdaines avatar Feb 07 '25 16:02 dhdaines

I am closing this issue, since we decided that "approximate" pagination in Word is not feasible to include.

cau-git avatar May 20 '25 18:05 cau-git