docling
docling copied to clipboard
Processing of TOC objects in Word Documents DOCX fails
Bug
We are seeing an odd behavior where the processing of a TOC table in a word document fails without any errors with the resulting document missing the content that was originally in the TOC. What we have tried:
- Using OCR fails with the TOC content being omitted.
- Exporting to PDF (using Word) and then using docling to convert to markdown works as expected with no content omissions.
Steps to reproduce
Use attached minimal example docx file and run:
docling sample.docx
resulting in the attached Markdown file which has the TOC content missing.
Docling version
Docling version: 2.13.0 Docling Core version: 2.12.1 Docling IBM Models version: 3.1.0 Docling Parse version: 3.0.0
Python version
Python 3.11.11
Note: all shared samples and publicly available documents. sample.md sample.docx