docling icon indicating copy to clipboard operation
docling copied to clipboard

What is the meaning of `missing-text`?

Open Belval opened this issue 1 year ago • 1 comments

Question

When exporting docx documents as text, I always seem to get some missing-text in the output. I was not able to find this string in the project repository, python-docx, or documentation.

Snippet:

doc_converter = DocumentConverter(allowed_formats=[InputFormat.DOCX])
conv_res = doc_converter.convert(input_doc_path)
print(conv_res.document.export_to_text())

Output:

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

<missing-text>

Documents:

  • Complete failure, all text is "missing-text": doc.docx
  • Partial failure, only some of the text is "missing-text": doc2.docx

Both documents are public.

What causes missing-text? What should be my mental model for it when processing documents?

Thanks!

Belval avatar Dec 02 '24 21:12 Belval

@Belval, thanks for sharing with sample documents, I will check this!

maxmnemonic avatar Dec 03 '24 12:12 maxmnemonic

@Belval here is a draft PR to fix the issue of missing text: https://github.com/DS4SD/docling/pull/528 Issue appeared when embedding text element into nested tables, moreover it's in a different xml tag that expected, I suspect those documents were exported from other applications than MSWord.

Regarding doc2 example - this PR seems to be fixing it completely Regarding doc - text is there, however the entire document is a complex nest of nested tables, that creates other artifacts in the final output, however now text is preserved. For this file we recommend it to convert it to PDF and then run through Docling for better quality results.

maxmnemonic avatar Dec 06 '24 08:12 maxmnemonic

Here's the missing-text dummy: https://github.com/DS4SD/docling-core/blob/f464be5521a92a21cb312b2d7f68489487b63b10/docling_core/types/doc/document.py#L2140

sanmai-NL avatar Dec 20 '24 09:12 sanmai-NL