python-docx2txt
python-docx2txt copied to clipboard
Page numbers for each page
Hi, so there is not a simple way to tell what page number some text is from.Could we add functionality to divide text by page numbers? Thank you
I too, am wondering how to figure out what page things are on...
Is there a way to split docs on page breaks? Or a way to somehow add something to the metadata that includes the pages?
Not possible without rebuilding the Word layout engine.
Sent from my iPhone
On Jun 21, 2023, at 10:57, Aron Weiler @.***> wrote:
I too, am wondering how to figure out what page things are on...
Is there a way to split docs on page breaks? Or a way to somehow add something to the metadata that includes the pages?
— Reply to this email directly, view it on GitHubhttps://github.com/ankushshah89/python-docx2txt/issues/44#issuecomment-1601104422, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADAKIE7BECTMVMDE42CVNJLXMMKWDANCNFSM6AAAAAAVPISSFA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Not possible without rebuilding the Word layout engine.
Yeah, I started poking through the specifications and format for the XML underlying the Word documents, and it seems like page numbers are a completely client-side thing.
There are possible work-arounds, like using the last rendered page break, but that is reliant on someone opening a word document (in Word), scrolling through it, and then re-saving it as Word itself adds the <w:lastRenderedPageBreak/>
XML fragment when it is rendering.
Obviously that's a bad idea to rely on for several reasons, and one appears to be that it doesn't even put the XML fragments in the right place all of the time.
For some context around what I wanted page numbers for- I am using docx2text to load word documents into a vector store to be consumed by an LLM, and I wanted to print out source locations inside of the documents that I found data in.