markitdown Preserve tables, titles (structure) of PDF documents

I was trying to convert a SEC 10-K (PDF) as an examples. Running it like this:

python -m markitdown ~/Downloads/4.General\ Electric\ Company.pdf > ge.md

And I see that the resulting Markdown doesn't include tables, titles, etc. Pretty much, no structure. It makes it suboptimal for analyzing documents downstream (e.g. to pass a particular table to LLM and ask it to calculate something, or at least extract a specific value).

If it is out of scope for this tool - feel free to close the ticket. I wonder if there simple to use tools that can do that. I was trying unstructured lib, but it requires a quite complicated setup also to extract tables and it seems their open source is becoming less maintained (?).

Documents:

4.General Electric Company.pdf

Dec 15 '24 19:12 shcheklein

PDFs are an absolute nightmare specifically for this reason, and we've had issues with reading order from direct file extraction because the order things are rendered in is not necessarily the order they're present in the file either.

We've just gone down the route of using Azure's Document Intelligence system which does a much better job of extracting structural information (even tables!), but of course that's not an offline option and does lead to some cost / latency considerations.

Dec 16 '24 14:12 ckpearson

Just tested this as well with a v simple table and found the output no better than raw pdf > txt.

Btw why is the output txt and not markdown? I don't see titles or section headings with ###. The output seems plain text.

Dec 16 '24 17:12 dipam7

Hi, you might want to take a look at https://github.com/Filimoa/open-parse .

Dec 17 '24 09:12 jvhgit

@shcheklein You can use docling (https://github.com/DS4SD/docling), it is designed for that!

Dec 17 '24 16:12 PeterStaar-IBM

This would be a great feature. Pymupdf is the only one I found doing it almost perfectly : https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html

Docling manage to do it but it is not that good.

Dec 19 '24 16:12 gabrielstuff

This would be a great feature. Pymupdf is the only one I found doing it almost perfectly : https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html

Docling manage to do it but it is not that good.

@gabrielstuff What do you mean? Running the following command,

docling --from pdf --to html --image-export-mode embedded /Users/taa/Downloads/4.General.Electric.Company.pdf

I get the following (few screenshots)

Dec 20 '24 12:12 PeterStaar-IBM

@PeterStaar-IBM, I will share some pdf. What I mean is that the table rendered by docling were not to the level of the ones from the pymupdf. The tables from markitdown are simply bad.

Jan 17 '25 13:01 gabrielstuff

We've just gone down the route of using Azure's Document Intelligence

@ckpearson Do you have code you can share for this?

Jan 30 '25 19:01 deepdive101

We've just gone down the route of using Azure's Document Intelligence

@ckpearson Do you have code you can share for this?

So everything you need should be here: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0&tabs=rest%2Csample-code

We're using the Layout extraction model as it's the most broadly capable, and does the structure analysis, though General goes as far as things like checkboxes etc.

They've got SDKs available for different languages, but ultimately it's also just a REST API call away.

Some interesting gotchas we ran into that might save you some time:

The raw text (named Content on the response objects) DOES NOT preserve visual new-lines
- The top-level document text has new-lines between paragraphs, but not within them; this is to account for the fact that a new line might be legitimate, or it might be due to content wrapping
- They leave the determination of that up to you, but you can find all of the lines that make up an element in the lines collection on a page
Content is indexed using spans where the index and length represent positions within the top-level document content string
Content theoretically can span multiple pages (paragraph, table etc) which is why boundingRegions can contain multiple entries over different pages
- In practice though I've not seen this happen yet, but you might want to account for it in case you're interested in the coordinates of items in the document
Tables are numbered globally in sequence (1, 2, etc)
- Tables that span multiple pages WILL get seen as separate tables, you'll need to do your own work for gluing them back together in some sort of post-process step
Text content inside tables are ALSO paragraphs in the paragraphs collection, so if you don't want to process them as independent from the table they're contained within, you'll want to do some processing to exclude content that is contained within other content
- We used an interval tree for this, where we add the content items based on their smallest span index and total length
  - Then finding if a paragraph is inside a table is a pretty easy query of "find me tables where this paragraph's span overlaps"

Happy hacking!

Feb 05 '25 10:02 ckpearson