python-docx2txt icon indicating copy to clipboard operation
python-docx2txt copied to clipboard

Page numbers for each page

Open Higgs32584 opened this issue 1 year ago • 3 comments

Hi, so there is not a simple way to tell what page number some text is from.Could we add functionality to divide text by page numbers? Thank you

Higgs32584 avatar Mar 04 '23 02:03 Higgs32584

I too, am wondering how to figure out what page things are on...

Is there a way to split docs on page breaks? Or a way to somehow add something to the metadata that includes the pages?

aronweiler avatar Jun 21 '23 15:06 aronweiler

Not possible without rebuilding the Word layout engine.

Sent from my iPhone

On Jun 21, 2023, at 10:57, Aron Weiler @.***> wrote:



I too, am wondering how to figure out what page things are on...

Is there a way to split docs on page breaks? Or a way to somehow add something to the metadata that includes the pages?

— Reply to this email directly, view it on GitHubhttps://github.com/ankushshah89/python-docx2txt/issues/44#issuecomment-1601104422, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE7BECTMVMDE42CVNJLXMMKWDANCNFSM6AAAAAAVPISSFA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ShayHill avatar Jun 21 '23 16:06 ShayHill

Not possible without rebuilding the Word layout engine.

Yeah, I started poking through the specifications and format for the XML underlying the Word documents, and it seems like page numbers are a completely client-side thing.

There are possible work-arounds, like using the last rendered page break, but that is reliant on someone opening a word document (in Word), scrolling through it, and then re-saving it as Word itself adds the <w:lastRenderedPageBreak/> XML fragment when it is rendering.

Obviously that's a bad idea to rely on for several reasons, and one appears to be that it doesn't even put the XML fragments in the right place all of the time.

For some context around what I wanted page numbers for- I am using docx2text to load word documents into a vector store to be consumed by an LLM, and I wanted to print out source locations inside of the documents that I found data in.

aronweiler avatar Jun 21 '23 17:06 aronweiler