Tables in pdf files are not converted properly
I converted a pdf-file with lots of table to markdown. I had expected that markitdown would handle tables gracefully. For example, the following table:
Should be converted into markdown like so:
| Register name | Description | Offset Address |
|---------------|---------------------------------|----------------|
| FMC_ACCTRL | Flash access control register | 0x00 |
| FMC_KEY | Flash key register | 0x04 |
| FMC_OPTKEY | Flash option key register | 0x08 |
| FMC_STS | Flash state register | 0x0C |
| FMC_CTRL | Flash control register | 0x10 |
| FMC_OPTCTRL | Flash option control register | 0x14 |
However, what I get from markitdown is this:
Register address mapping
Table 14 FMC Register Address Mapping
Register name
Description
Offset Address
FMC_ACCTRL
Flash access control register
FMC_KEY
Flash key register
FMC_OPTKEY
Flash option key register
FMC_STS
FMC_CTRL
Flash state register
Flash control register
FMC_OPTCTRL
Flash option control register
0x00
0x04
0x08
0x0C
0x10
0x14
The number 3.6 in the title is gone. But what's worse: the entire table is spread out.
+1
+1
+1
+1
+1
hi all, please consider using emoji thumbsup - no one likes to see a page of +1 comments
it might use pdf2md lib like pdfplumber, can tell me other ppt2md resposity ?
Can we get the official answer to this, are tables in PDF's not supported? Are they gonna be supported? What is the best alternative? Thanks
If you look at the PDF converter code, it's just a simple call to pdfminer: https://github.com/microsoft/markitdown/blob/3fcd48cdfc651cbf508071c8d2fb7d82aeb075de/packages/markitdown/src/markitdown/converters/_pdf_converter.py#L76-L78
pdfminer's not as powerful as other PDF-to-text Python packages like PyMuPDF, for example.
Unless Microsoft uses a different PDF-to-text package or develops their own, I don't think tables in PDFs will be supported for a long while. (FWIW, Microsoft can't just use PyMuPDF because of its AGPL license.)