markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Pagewise Markdown output

Open bahtman opened this issue 1 year ago • 3 comments

Hi guys,

Love the work. In our current approach we convert documents to a list of markdown, where each element consists of the markdown for that specific page. This is really useful in RAG when providing citations and enriching the chunks during completion.

I'm unsure how feasible it is for PDF, docx and other "paginated" content.

bahtman avatar Dec 18 '24 09:12 bahtman

The .pptx extractor includes <!-- Slide number: X --> which I use to split the markdown by page. In Excel there are ## Sheet name titles for each sheet.

I agree that it would be nice with paged output on the other formats (pdf, docx for me) and also have an API to get the output paged directly so I don't need to split it myself based on markers.

jonasb avatar Feb 14 '25 08:02 jonasb

This would be extremely useful. Currently using Azure Document Intelligence that can provide markdown of PDF files by page but we want to adopt markitdown to support more file formats (pdf, pptx, docx, xlsx)

tsitsimis avatar Sep 11 '25 06:09 tsitsimis

For those still looking for page-wise markdown extraction, the library markitdown is based on has this feature

emcf avatar Oct 01 '25 14:10 emcf