docling icon indicating copy to clipboard operation
docling copied to clipboard

msexcel_backend.py doesn’t parse complex Excel tables properly.

Open rafaelsanchezsouza opened this issue 11 months ago • 2 comments

Bug

When attempting to open an Excel document with complex tables, Docling fails to extract the tables correctly.

Steps to reproduce

from docling.document_converter import DocumentConverter

source = "./excel-tests.xlsx"  # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "## Docling Technical Report[...]"

excel-tests.xlsx

Output

HIGH VOLTAGE SWITCHBOARD
DATA SHEET
MODEL-000
2016
Page 1 of 10
Package no.:
13156456
Doc. no.:
144564
Rev.
A
Power system
1
=(A11+1)
=(A12+1)
=(A13+1)
=(A14+1)
=(A15+1)
Construction
7
=(A18+1)
=(A19+1)
Environmental conditions
=(A20+1)
=(A22+1)
=(A23+1)
=(A24+1)
Arc test
=(A25+1)
Notes
1
2
Rated system voltage
Rated system frequency
No. of phases
System earthing
Earth fault current
Control voltage supply
kV : 130 (131 Um, 132 AC, 133 BIL)
Hz : 134
: 3
: Solidly Earthed
: 135 kA
: 2 x 136V AC UPS 1 x 137V AC normal
A : 135 kA
: 2 x 136V AC UPS 1 x 137V AC normal
Metal-enclosed partition
VT for cable discharging
Voltage and Current measurement
No
Low Power Instrument Transformers
Hazardous area classification
Ambient temp.
Location
Humidity
: Non hazardous
: Min. -5, max. +40
: Indoor
: 100
Converted to 110VDC Converted to 110VDC Converted to 110VDC Converted to 110VDC Converted to 110VDC
Arc test (type test) Arc test (type test) Arc test (type test) Arc test (type test) Arc test (type test)
None None None None None

Docling version

Docling version: 2.17.0 Docling Core version: 2.16.0 Docling IBM Models version: 3.3.0 Docling Parse version: 3.1.2 Python: cpython-310 (3.10.7) Platform: Windows-10-10.0.19045-SP0

Python version

Python 3.10.7

Final Considerations

I understand that the table is complex, so I would like to know what would be the requirements for an Excel document to work with Docling. Digging into the code, I noticed this:

  • split_text_and_number - That regex is not trimming the match.groups()

Hope it helps,

Let me know if you need more information.

Have a nice day!

rafaelsanchezsouza avatar Jan 29 '25 18:01 rafaelsanchezsouza

Hey, I am interested in helping with this. Is the assignee still actively working on it? If not, I'd be happy to do it or collaborate?

Ra5hidIslam avatar Nov 04 '25 07:11 Ra5hidIslam

@Ra5hidIslam you are very welcome to help with this issue. You could start with the current version on main. Some aspects have been already addressed, like the formulas, but the challenge of complex table layout is still there.

ceberam avatar Nov 04 '25 08:11 ceberam