unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug(docx): O(N^2) time on large section+paragraph count

Open scanny opened this issue 1 year ago • 0 comments

Summary A DOCX with a large number of sections combined with a large number of paragraphs triggers an O(N^2) process to determine the paragraphs in each section. This greatly slows partitioning of these documents.

To Reproduce The original document in this case had 470 sections and 31,723 paragraphs. Partitioning took in excess of 40 minutes.

Expected behavior Partitioning time should be O(N) with the number of paragraphs.

Diagnosis The upstream python-docx SectBlockIterator algorithm is effectively O(N^2) when called once for each section. https://github.com/python-openxml/python-docx/blob/master/src/docx/oxml/section.py#L437

It gets all the block items before the section end and then subtracts all the block items before the prior section end. This is effectively N^2 with the number of block-items (paragraphs/tables) in the document.

Work out an algorithm to get only the block items between the prior section end and the current section end. Something roughly like:

section_p = Section._sectPr
prior_section_p = section_p.xpath(./[prior-sibling]::w:sectPr)
block_item = prior_section_p
while block_item := block_item.next():
    yield block_item
    if block_item = section_p:
        break

There will be some edge-cases to work out in tests because of w:p/w:sectPr and w:sectPr diversity between body and document-end w:sectPrs.

scanny avatar Sep 03 '24 19:09 scanny