Preserve tables, titles (structure) of PDF documents
I was trying to convert a SEC 10-K (PDF) as an examples. Running it like this:
python -m markitdown ~/Downloads/4.General\ Electric\ Company.pdf > ge.md
And I see that the resulting Markdown doesn't include tables, titles, etc. Pretty much, no structure. It makes it suboptimal for analyzing documents downstream (e.g. to pass a particular table to LLM and ask it to calculate something, or at least extract a specific value).
If it is out of scope for this tool - feel free to close the ticket. I wonder if there simple to use tools that can do that. I was trying unstructured lib, but it requires a quite complicated setup also to extract tables and it seems their open source is becoming less maintained (?).
Documents:
PDFs are an absolute nightmare specifically for this reason, and we've had issues with reading order from direct file extraction because the order things are rendered in is not necessarily the order they're present in the file either.
We've just gone down the route of using Azure's Document Intelligence system which does a much better job of extracting structural information (even tables!), but of course that's not an offline option and does lead to some cost / latency considerations.
Just tested this as well with a v simple table and found the output no better than raw pdf > txt.
Btw why is the output txt and not markdown? I don't see titles or section headings with ###. The output seems plain text.
Hi, you might want to take a look at https://github.com/Filimoa/open-parse .
@shcheklein You can use docling (https://github.com/DS4SD/docling), it is designed for that!
This would be a great feature. Pymupdf is the only one I found doing it almost perfectly : https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html
Docling manage to do it but it is not that good.
This would be a great feature. Pymupdf is the only one I found doing it almost perfectly : https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html
Docling manage to do it but it is not that good.
@gabrielstuff What do you mean? Running the following command,
docling --from pdf --to html --image-export-mode embedded /Users/taa/Downloads/4.General.Electric.Company.pdf
I get the following (few screenshots)
@PeterStaar-IBM, I will share some pdf. What I mean is that the table rendered by docling were not to the level of the ones from the pymupdf.
The tables from markitdown are simply bad.
We've just gone down the route of using Azure's Document Intelligence
@ckpearson Do you have code you can share for this?
We've just gone down the route of using Azure's Document Intelligence
@ckpearson Do you have code you can share for this?
So everything you need should be here: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0&tabs=rest%2Csample-code
We're using the Layout extraction model as it's the most broadly capable, and does the structure analysis, though General goes as far as things like checkboxes etc.
They've got SDKs available for different languages, but ultimately it's also just a REST API call away.
Some interesting gotchas we ran into that might save you some time:
- The raw text (named
Contenton the response objects) DOES NOT preserve visual new-lines- The top-level document text has new-lines between paragraphs, but not within them; this is to account for the fact that a new line might be legitimate, or it might be due to content wrapping
- They leave the determination of that up to you, but you can find all of the lines that make up an element in the
linescollection on a page
- Content is indexed using
spanswhere theindexandlengthrepresent positions within the top-level documentcontentstring - Content theoretically can span multiple pages (paragraph, table etc) which is why
boundingRegionscan contain multiple entries over different pages- In practice though I've not seen this happen yet, but you might want to account for it in case you're interested in the coordinates of items in the document
- Tables are numbered globally in sequence (1, 2, etc)
- Tables that span multiple pages WILL get seen as separate tables, you'll need to do your own work for gluing them back together in some sort of post-process step
- Text content inside tables are ALSO paragraphs in the
paragraphscollection, so if you don't want to process them as independent from the table they're contained within, you'll want to do some processing to exclude content that is contained within other content- We used an interval tree for this, where we add the content items based on their smallest
spanindex and total length- Then finding if a paragraph is inside a table is a pretty easy query of "find me tables where this paragraph's span overlaps"
- We used an interval tree for this, where we add the content items based on their smallest
Happy hacking!