unstructured DOCX doesn't recognize listitems within textbox

Describe the bug DOCX doesn't recognize listitems within textbox element of word document

To Reproduce Provide a sample word file with 2 kind of list items. you can see in the screenshot above that only the "plain" listitems are recognized and those within the textbox are missing from the elements list

list_in_texbox_list-item-missing.docx

Screenshots

May 28 '24 11:05 veredmm

this is the file content :

May 28 '24 11:05 veredmm

@scanny - Any thoughts on this one?

May 28 '24 12:05 MthwRobinson

We currently extract run text from inline text-box shapes along with the rest of the text in the paragraph to which the textbox is anchored. This behavior was added in this PR: https://github.com/Unstructured-IO/unstructured/pull/2510

We could potentially do this differently such that both inline and floating text-boxes were separately partitioned, which would recognize list-items inside them each as a separate element.

Background

A run is an inline element (think HTML <span>) within a paragraph. Paragraph text can only appear within a run. The text of a paragraph is the concatenation of the text in each of its runs.
A (DOCX) shape contains one of several possible "graphical" items, including a textbox, but can also be an image, chart, SmartArt, etc.
A textbox shape contains one or more paragraphs. In general each non-empty paragraph in a document gives rise to a single element in the output.
A shape can either be inline or floating. An inline shape is treated like a large character and flows with the text of the paragraph. A floating shape is anchored to a paragraph but can be moved to an arbitrary position and text flows around it.

The approach taken in the prior PR was to include any text in an inline textbox with the text of the paragraph in which it occurs.

Because this only applies to inline shapes and the example here is floating, the "Aaa.." text does not appear in the partitioning output.
If it were an inline textbox, all the text would appear together in a single element, like text="AaaBbbccc" because this is the concatenation of all the runs in the textbox and the paragraph it occurs in is otherwise empty.
If we wanted to partition textbox shapes more precisely, we would need to add a subpartitioner that considered the paragraphs in the text-box separately, each giving rise to their own element. In this case the paragraphs are identified as list items so the textbox would produce three ListItem elements that would occur immediately after the element containing the other text in the paragraph (empty in this particular case).

May 28 '24 18:05 scanny

@scanny - Any suggestions to workarounds in case I have many documents in this structure ( floating shapes with a lot of text inside) ?

May 29 '24 06:05 veredmm

@veredmm Not off the top of my head, no. A general-case solution is pretty disruptive to the current partitioner structure (so wouldn't be easy to monkey-patch or whatever) and would require deep domain knowledge of the DOCX format.

That said, if you changed this line: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L441

from:

"w:r | w:hyperlink | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"

to:

"w:r"
" | w:hyperlink"
" | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"
" | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r"

(note wp:anchor (floating shape) in addition to wp:inline (inline shape))

Then the text inside the textboxes would at least appear in the output.

It wouldn't be pretty because paragraph text would be joined together without a space in between, like:

the quick brown fox

jumped over the lazy dog

would appear as: "whatever text came beforethe quick brown foxjumped over the lazy dogwhatever text came after"

So you'd have to judge whether the benefit was worth the trouble.

May 29 '24 20:05 scanny

@scanny thanks ! but I wonder why not to just add a space in the join statement to prevent the words joining: text = " ".join( e.text for e in paragraph._p.xpath( "w:r" " | w:hyperlink" " | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r" " | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r" ) )

May 30 '24 10:05 veredmm

@veredmm Could do, but that would place an extra space between regular runs, which already contain whatever space they need.

May 30 '24 17:05 scanny

@scanny

I recently encountered this issue when working with sections that contain text within textboxes or similar nested structures (headers and footers require accessing more files from the tree apart from the main one, though). After some investigation, I found a workaround that has helped me address this issue. To my knowledge, the problem seems to stem from how the content is iterated in sections.

Please refer to the file: python-docx/src/docx/section.py

Replace original content:

162     for element in self._sectPr.iter_inner_content():
163         yield (Paragraph(element, self) if isinstance(element, CT_P) else Table(element, self))

With these lines:

13      from docx.oxml.table import CT_Tbl
...
163     for element in self._sectPr.iter_inner_content():
164         for item in element.iter():
165             if isinstance(item, CT_P):
166                 yield Paragraph(item, self)
167             elif isinstance(item, CT_Tbl):
168                 yield Table(item, self)
169                 break  # prevent iterating through its nested structure and duplicate content

While I know it’s not an optimal solution, I wanted to share it with you in the hope that it might be of some use or provide a foundation for further development. I humbly believe that it must require further testing and a broader perspective. I would be more than happy to provide any additional details if needed.

Thank you for your continuous efforts in maintaining and improving open-source software.

Aug 27 '24 07:08 carmonajca