Pagewise Markdown output
Hi guys,
Love the work. In our current approach we convert documents to a list of markdown, where each element consists of the markdown for that specific page. This is really useful in RAG when providing citations and enriching the chunks during completion.
I'm unsure how feasible it is for PDF, docx and other "paginated" content.
The .pptx extractor includes <!-- Slide number: X --> which I use to split the markdown by page. In Excel there are ## Sheet name titles for each sheet.
I agree that it would be nice with paged output on the other formats (pdf, docx for me) and also have an API to get the output paged directly so I don't need to split it myself based on markers.
This would be extremely useful. Currently using Azure Document Intelligence that can provide markdown of PDF files by page but we want to adopt markitdown to support more file formats (pdf, pptx, docx, xlsx)
For those still looking for page-wise markdown extraction, the library markitdown is based on has this feature