DOCX doesn't recognize listitems within textbox
Describe the bug DOCX doesn't recognize listitems within textbox element of word document
To Reproduce Provide a sample word file with 2 kind of list items. you can see in the screenshot above that only the "plain" listitems are recognized and those within the textbox are missing from the elements list
list_in_texbox_list-item-missing.docx
Screenshots
this is the file content :
@scanny - Any thoughts on this one?
We currently extract run text from inline text-box shapes along with the rest of the text in the paragraph to which the textbox is anchored. This behavior was added in this PR: https://github.com/Unstructured-IO/unstructured/pull/2510
We could potentially do this differently such that both inline and floating text-boxes were separately partitioned, which would recognize list-items inside them each as a separate element.
Background
- A run is an inline element (think HTML
<span>) within a paragraph. Paragraph text can only appear within a run. The text of a paragraph is the concatenation of the text in each of its runs. - A (DOCX) shape contains one of several possible "graphical" items, including a textbox, but can also be an image, chart, SmartArt, etc.
- A textbox shape contains one or more paragraphs. In general each non-empty paragraph in a document gives rise to a single element in the output.
- A shape can either be inline or floating. An inline shape is treated like a large character and flows with the text of the paragraph. A floating shape is anchored to a paragraph but can be moved to an arbitrary position and text flows around it.
The approach taken in the prior PR was to include any text in an inline textbox with the text of the paragraph in which it occurs.
- Because this only applies to inline shapes and the example here is floating, the "Aaa.." text does not appear in the partitioning output.
- If it were an inline textbox, all the text would appear together in a single element, like
text="AaaBbbccc"because this is the concatenation of all the runs in the textbox and the paragraph it occurs in is otherwise empty. - If we wanted to partition textbox shapes more precisely, we would need to add a subpartitioner that considered the paragraphs in the text-box separately, each giving rise to their own element. In this case the paragraphs are identified as list items so the textbox would produce three
ListItemelements that would occur immediately after the element containing the other text in the paragraph (empty in this particular case).
@scanny - Any suggestions to workarounds in case I have many documents in this structure ( floating shapes with a lot of text inside) ?
@veredmm Not off the top of my head, no. A general-case solution is pretty disruptive to the current partitioner structure (so wouldn't be easy to monkey-patch or whatever) and would require deep domain knowledge of the DOCX format.
That said, if you changed this line: https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L441
from:
"w:r | w:hyperlink | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"
to:
"w:r"
" | w:hyperlink"
" | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r"
" | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r"
(note wp:anchor (floating shape) in addition to wp:inline (inline shape))
Then the text inside the textboxes would at least appear in the output.
It wouldn't be pretty because paragraph text would be joined together without a space in between, like:
- the quick brown fox
- jumped over the lazy dog
would appear as: "whatever text came beforethe quick brown foxjumped over the lazy dogwhatever text came after"
So you'd have to judge whether the benefit was worth the trouble.
@scanny thanks ! but I wonder why not to just add a space in the join statement to prevent the words joining: text = " ".join( e.text for e in paragraph._p.xpath( "w:r" " | w:hyperlink" " | w:r/descendant::wp:inline[ancestor::w:drawing][1]//w:r" " | w:r/descendant::wp:anchor[ancestor::w:drawing][1]//w:r" ) )
@veredmm Could do, but that would place an extra space between regular runs, which already contain whatever space they need.
@scanny
I recently encountered this issue when working with sections that contain text within textboxes or similar nested structures (headers and footers require accessing more files from the tree apart from the main one, though). After some investigation, I found a workaround that has helped me address this issue. To my knowledge, the problem seems to stem from how the content is iterated in sections.
Please refer to the file: python-docx/src/docx/section.py
Replace original content:
162 for element in self._sectPr.iter_inner_content():
163 yield (Paragraph(element, self) if isinstance(element, CT_P) else Table(element, self))
With these lines:
13 from docx.oxml.table import CT_Tbl
...
163 for element in self._sectPr.iter_inner_content():
164 for item in element.iter():
165 if isinstance(item, CT_P):
166 yield Paragraph(item, self)
167 elif isinstance(item, CT_Tbl):
168 yield Table(item, self)
169 break # prevent iterating through its nested structure and duplicate content
While I know it’s not an optimal solution, I wanted to share it with you in the hope that it might be of some use or provide a foundation for further development. I humbly believe that it must require further testing and a broader perspective. I would be more than happy to provide any additional details if needed.
Thank you for your continuous efforts in maintaining and improving open-source software.