Unable to filter out headers, footers, or page numbers in pdf to md conversion
Markitdown doesn't automatically distinguish or filter out headers, footers, or page numbers inserted into the body during PDF-to-markdown conversion
I see no one has commented on this yet. I’m curious whether this is even feasible to implement reliably. Since PDFs don’t explicitly mark headers or footers, it might be difficult to filter them out without accidentally removing valid content. Does anyone think there’s a safe approach that could work consistently for this project?
I tried it on a pdf book, a programming book, and this is how the table of content looked like:
And some code snippets:
In short,
MarkItDown can not extract the structure of the book.
The result will need a huge amount of pre-process effort to be useful for LLM.
No magic here. Good at text extraction. Very fast. No free pre-process. Still need our hands dirty.
Ok so I think I may have finally implemented a solution to this issue. I'll be uploading my file to the repo on my profile and doing a new pull request at some point. This version uses pdfplumber instead of pdfminer
On the left is the original versus the new version on the right.