markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Preserve tables, titles (structure) of PDF documents

Open shcheklein opened this issue 1 year ago • 9 comments

I was trying to convert a SEC 10-K (PDF) as an examples. Running it like this:

python -m markitdown ~/Downloads/4.General\ Electric\ Company.pdf > ge.md

And I see that the resulting Markdown doesn't include tables, titles, etc. Pretty much, no structure. It makes it suboptimal for analyzing documents downstream (e.g. to pass a particular table to LLM and ask it to calculate something, or at least extract a specific value).

If it is out of scope for this tool - feel free to close the ticket. I wonder if there simple to use tools that can do that. I was trying unstructured lib, but it requires a quite complicated setup also to extract tables and it seems their open source is becoming less maintained (?).

Documents:

4.General Electric Company.pdf

shcheklein avatar Dec 15 '24 19:12 shcheklein

PDFs are an absolute nightmare specifically for this reason, and we've had issues with reading order from direct file extraction because the order things are rendered in is not necessarily the order they're present in the file either.

We've just gone down the route of using Azure's Document Intelligence system which does a much better job of extracting structural information (even tables!), but of course that's not an offline option and does lead to some cost / latency considerations.

ckpearson avatar Dec 16 '24 14:12 ckpearson

Just tested this as well with a v simple table and found the output no better than raw pdf > txt.

Btw why is the output txt and not markdown? I don't see titles or section headings with ###. The output seems plain text.

dipam7 avatar Dec 16 '24 17:12 dipam7

Hi, you might want to take a look at https://github.com/Filimoa/open-parse .

jvhgit avatar Dec 17 '24 09:12 jvhgit

@shcheklein You can use docling (https://github.com/DS4SD/docling), it is designed for that!

PeterStaar-IBM avatar Dec 17 '24 16:12 PeterStaar-IBM

This would be a great feature. Pymupdf is the only one I found doing it almost perfectly : https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html

Docling manage to do it but it is not that good.

gabrielstuff avatar Dec 19 '24 16:12 gabrielstuff

This would be a great feature. Pymupdf is the only one I found doing it almost perfectly : https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html

Docling manage to do it but it is not that good.

@gabrielstuff What do you mean? Running the following command,

docling --from pdf --to html --image-export-mode embedded /Users/taa/Downloads/4.General.Electric.Company.pdf

I get the following (few screenshots)

Image

Image

PeterStaar-IBM avatar Dec 20 '24 12:12 PeterStaar-IBM

@PeterStaar-IBM, I will share some pdf. What I mean is that the table rendered by docling were not to the level of the ones from the pymupdf. The tables from markitdown are simply bad.

gabrielstuff avatar Jan 17 '25 13:01 gabrielstuff

We've just gone down the route of using Azure's Document Intelligence

@ckpearson Do you have code you can share for this?

deepdive101 avatar Jan 30 '25 19:01 deepdive101

We've just gone down the route of using Azure's Document Intelligence

@ckpearson Do you have code you can share for this?

So everything you need should be here: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0&tabs=rest%2Csample-code

We're using the Layout extraction model as it's the most broadly capable, and does the structure analysis, though General goes as far as things like checkboxes etc.

They've got SDKs available for different languages, but ultimately it's also just a REST API call away.

Some interesting gotchas we ran into that might save you some time:

  • The raw text (named Content on the response objects) DOES NOT preserve visual new-lines
    • The top-level document text has new-lines between paragraphs, but not within them; this is to account for the fact that a new line might be legitimate, or it might be due to content wrapping
    • They leave the determination of that up to you, but you can find all of the lines that make up an element in the lines collection on a page
  • Content is indexed using spans where the index and length represent positions within the top-level document content string
  • Content theoretically can span multiple pages (paragraph, table etc) which is why boundingRegions can contain multiple entries over different pages
    • In practice though I've not seen this happen yet, but you might want to account for it in case you're interested in the coordinates of items in the document
  • Tables are numbered globally in sequence (1, 2, etc)
    • Tables that span multiple pages WILL get seen as separate tables, you'll need to do your own work for gluing them back together in some sort of post-process step
  • Text content inside tables are ALSO paragraphs in the paragraphs collection, so if you don't want to process them as independent from the table they're contained within, you'll want to do some processing to exclude content that is contained within other content
    • We used an interval tree for this, where we add the content items based on their smallest span index and total length
      • Then finding if a paragraph is inside a table is a pretty easy query of "find me tables where this paragraph's span overlaps"

Happy hacking!

ckpearson avatar Feb 05 '25 10:02 ckpearson