docling icon indicating copy to clipboard operation
docling copied to clipboard

Processing of TOC objects in Word Documents DOCX fails

Open w0o opened this issue 1 year ago • 0 comments

Bug

We are seeing an odd behavior where the processing of a TOC table in a word document fails without any errors with the resulting document missing the content that was originally in the TOC. What we have tried:

  • Using OCR fails with the TOC content being omitted.
  • Exporting to PDF (using Word) and then using docling to convert to markdown works as expected with no content omissions.

Steps to reproduce

Use attached minimal example docx file and run:

docling sample.docx

resulting in the attached Markdown file which has the TOC content missing.

Docling version

Docling version: 2.13.0 Docling Core version: 2.12.1 Docling IBM Models version: 3.1.0 Docling Parse version: 3.0.0

Python version

Python 3.11.11

Note: all shared samples and publicly available documents. sample.md sample.docx

w0o avatar Dec 19 '24 03:12 w0o