markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Tables in pdf files are not converted properly

Open kristofmulier opened this issue 11 months ago • 9 comments

user_manual.pdf

I converted a pdf-file with lots of table to markdown. I had expected that markitdown would handle tables gracefully. For example, the following table:

Image

Should be converted into markdown like so:

| Register name | Description                     | Offset Address |
|---------------|---------------------------------|----------------|
| FMC_ACCTRL    | Flash access control register   | 0x00           |
| FMC_KEY       | Flash key register              | 0x04           |
| FMC_OPTKEY    | Flash option key register       | 0x08           |
| FMC_STS       | Flash state register            | 0x0C           |
| FMC_CTRL      | Flash control register          | 0x10           |
| FMC_OPTCTRL   | Flash option control register   | 0x14           |

However, what I get from markitdown is this:

  Register address mapping

Table 14 FMC Register Address Mapping

Register name

Description

Offset Address

FMC_ACCTRL

Flash access control register

FMC_KEY

Flash key register

FMC_OPTKEY

Flash option key register

FMC_STS

FMC_CTRL

Flash state register

Flash control register

FMC_OPTCTRL

Flash option control register

0x00

0x04

0x08

0x0C

0x10

0x14

The number 3.6 in the title is gone. But what's worse: the entire table is spread out.

kristofmulier avatar Jan 18 '25 16:01 kristofmulier

+1

numairmansur avatar Jan 19 '25 11:01 numairmansur

+1

morinooji avatar Jan 23 '25 01:01 morinooji

+1

alekshandra avatar Jan 23 '25 11:01 alekshandra

+1

fd2533 avatar Jan 23 '25 11:01 fd2533

+1

NV-MichaelBauer avatar Jan 24 '25 12:01 NV-MichaelBauer

hi all, please consider using emoji thumbsup - no one likes to see a page of +1 comments

schlichtanders avatar Jan 31 '25 11:01 schlichtanders

it might use pdf2md lib like pdfplumber, can tell me other ppt2md resposity ?

Dimitri666 avatar Feb 25 '25 01:02 Dimitri666

Can we get the official answer to this, are tables in PDF's not supported? Are they gonna be supported? What is the best alternative? Thanks

avalanche-tm avatar Mar 26 '25 19:03 avalanche-tm

If you look at the PDF converter code, it's just a simple call to pdfminer: https://github.com/microsoft/markitdown/blob/3fcd48cdfc651cbf508071c8d2fb7d82aeb075de/packages/markitdown/src/markitdown/converters/_pdf_converter.py#L76-L78

pdfminer's not as powerful as other PDF-to-text Python packages like PyMuPDF, for example.

Unless Microsoft uses a different PDF-to-text package or develops their own, I don't think tables in PDFs will be supported for a long while. (FWIW, Microsoft can't just use PyMuPDF because of its AGPL license.)

airknits avatar Apr 08 '25 01:04 airknits