Bug
When attempting to open an Excel document with complex tables, Docling fails to extract the tables correctly.
Steps to reproduce
from docling.document_converter import DocumentConverter
source = "./excel-tests.xlsx" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "## Docling Technical Report[...]"
excel-tests.xlsx
Output
| HIGH VOLTAGE SWITCHBOARD |
| DATA SHEET |
| MODEL-000 |
| 2016 |
| Page 1 of 10 |
| Power system |
| 1 |
| =(A11+1) |
| =(A12+1) |
| =(A13+1) |
| =(A14+1) |
| =(A15+1) |
| Construction |
| 7 |
| =(A18+1) |
| =(A19+1) |
| Environmental conditions |
| =(A20+1) |
| =(A22+1) |
| =(A23+1) |
| =(A24+1) |
| Arc test |
| =(A25+1) |
| Notes |
| 1 |
| 2 |
| Rated system voltage |
| Rated system frequency |
| No. of phases |
| System earthing |
| Earth fault current |
| Control voltage supply |
| kV |
: |
130 (131 Um, 132 AC, 133 BIL) |
| Hz |
: |
134 |
|
: |
3 |
| : |
Solidly Earthed |
| : |
135 kA |
| : |
2 x 136V AC UPS 1 x 137V AC normal |
| A |
: |
135 kA |
|
: |
2 x 136V AC UPS 1 x 137V AC normal |
| Metal-enclosed partition |
| VT for cable discharging |
| Voltage and Current measurement |
| No |
| Low Power Instrument Transformers |
| Hazardous area classification |
| Ambient temp. |
| Location |
| Humidity |
| : |
Non hazardous |
| : |
Min. -5, max. +40 |
| : |
Indoor |
| : |
100 |
| Converted to 110VDC |
Converted to 110VDC |
Converted to 110VDC |
Converted to 110VDC |
Converted to 110VDC |
| Arc test (type test) |
Arc test (type test) |
Arc test (type test) |
Arc test (type test) |
Arc test (type test) |
| None |
None |
None |
None |
None |
Docling version
Docling version: 2.17.0
Docling Core version: 2.16.0
Docling IBM Models version: 3.3.0
Docling Parse version: 3.1.2
Python: cpython-310 (3.10.7)
Platform: Windows-10-10.0.19045-SP0
Python version
Python 3.10.7
Final Considerations
I understand that the table is complex, so I would like to know what would be the requirements for an Excel document to work with Docling. Digging into the code, I noticed this:
-
split_text_and_number - That regex is not trimming the match.groups()
Hope it helps,
Let me know if you need more information.
Have a nice day!
Hey, I am interested in helping with this. Is the assignee still actively working on it? If not, I'd be happy to do it or collaborate?
@Ra5hidIslam you are very welcome to help with this issue. You could start with the current version on main. Some aspects have been already addressed, like the formulas, but the challenge of complex table layout is still there.