What is the meaning of `missing-text`?
Question
When exporting docx documents as text, I always seem to get some missing-text in the output. I was not able to find this string in the project repository, python-docx, or documentation.
Snippet:
doc_converter = DocumentConverter(allowed_formats=[InputFormat.DOCX])
conv_res = doc_converter.convert(input_doc_path)
print(conv_res.document.export_to_text())
Output:
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
<missing-text>
Documents:
- Complete failure, all text is "missing-text": doc.docx
- Partial failure, only some of the text is "missing-text": doc2.docx
Both documents are public.
What causes missing-text? What should be my mental model for it when processing documents?
Thanks!
@Belval, thanks for sharing with sample documents, I will check this!
@Belval here is a draft PR to fix the issue of missing text: https://github.com/DS4SD/docling/pull/528 Issue appeared when embedding text element into nested tables, moreover it's in a different xml tag that expected, I suspect those documents were exported from other applications than MSWord.
Regarding doc2 example - this PR seems to be fixing it completely Regarding doc - text is there, however the entire document is a complex nest of nested tables, that creates other artifacts in the final output, however now text is preserved. For this file we recommend it to convert it to PDF and then run through Docling for better quality results.
Here's the missing-text dummy: https://github.com/DS4SD/docling-core/blob/f464be5521a92a21cb312b2d7f68489487b63b10/docling_core/types/doc/document.py#L2140