markitdown Unable to filter out headers, footers, or page numbers in pdf to md conversion

Markitdown doesn't automatically distinguish or filter out headers, footers, or page numbers inserted into the body during PDF-to-markdown conversion

Aug 26 '25 11:08 privtools

I see no one has commented on this yet. I’m curious whether this is even feasible to implement reliably. Since PDFs don’t explicitly mark headers or footers, it might be difficult to filter them out without accidentally removing valid content. Does anyone think there’s a safe approach that could work consistently for this project?

Aug 27 '25 15:08 ParadigmPacket

I tried it on a pdf book, a programming book, and this is how the table of content looked like:

And some code snippets:

In short, MarkItDown can not extract the structure of the book. The result will need a huge amount of pre-process effort to be useful for LLM.

No magic here. Good at text extraction. Very fast. No free pre-process. Still need our hands dirty.

Sep 27 '25 02:09 haims-3607

Ok so I think I may have finally implemented a solution to this issue. I'll be uploading my file to the repo on my profile and doing a new pull request at some point. This version uses pdfplumber instead of pdfminer

On the left is the original versus the new version on the right.

Nov 19 '25 17:11 ParadigmPacket