markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Unable to filter out headers, footers, or page numbers in pdf to md conversion

Open privtools opened this issue 6 months ago • 3 comments

Markitdown doesn't automatically distinguish or filter out headers, footers, or page numbers inserted into the body during PDF-to-markdown conversion

privtools avatar Aug 26 '25 11:08 privtools

I see no one has commented on this yet. I’m curious whether this is even feasible to implement reliably. Since PDFs don’t explicitly mark headers or footers, it might be difficult to filter them out without accidentally removing valid content. Does anyone think there’s a safe approach that could work consistently for this project?

ParadigmPacket avatar Aug 27 '25 15:08 ParadigmPacket

I tried it on a pdf book, a programming book, and this is how the table of content looked like:

Image

And some code snippets:

Image Image

In short, MarkItDown can not extract the structure of the book. The result will need a huge amount of pre-process effort to be useful for LLM.

No magic here. Good at text extraction. Very fast. No free pre-process. Still need our hands dirty.

haims-3607 avatar Sep 27 '25 02:09 haims-3607

Ok so I think I may have finally implemented a solution to this issue. I'll be uploading my file to the repo on my profile and doing a new pull request at some point. This version uses pdfplumber instead of pdfminer

On the left is the original versus the new version on the right.

Image

ParadigmPacket avatar Nov 19 '25 17:11 ParadigmPacket