unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Quickfix for elements sharing the same memory address

Open micmarty-deepsense opened this issue 9 months ago • 2 comments

This PR attempts to fix a memory issue, which resulted in errors like this: https://github.com/Unstructured-IO/unstructured/issues/2931 The root cause seems to be in how ListItems are being combined, not in how hashes or parent IDs are updated.

When assign_and_map_hash_ids() is called and elements (or elements' metadata) do not have unique memory addresses, then updating the parent_id of one element will also overwrite the parent_id of some other element.

micmarty-deepsense avatar Apr 26 '24 14:04 micmarty-deepsense

I checked out to your branch and verified this does not raise the KeyError, but can you add a test? I suppose it could be a follow-up PR. Approving in advance

Coniferish avatar Apr 26 '24 15:04 Coniferish

https://github.com/Unstructured-IO/unstructured/actions/runs/8850164463/job/24304063524

=========================== short test summary info ============================
FAILED test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_word_bbox_not_char - assert 18 == 17
 +  where 18 = len([<unstructured.documents.elements.Header object at 0x7ff8e4d241d0>, <unstructured.documents.elements.Header object at ...NarrativeText object at 0x7ff8e4d0eb10>, <unstructured.documents.elements.NarrativeText object at 0x7ff8f8a89ad0>, ...])
= 1 failed, 2231 passed, 13 skipped, 2 deselected, 2 xfailed, 9 xpassed, 8 warnings in 1170.99s (0:19:30) =

. either this test was incorrect before, or the def _combine_list_elements change broke something. let's make this a real release (not -dev) when tests pass.

cragwolfe avatar Apr 27 '24 18:04 cragwolfe

Amazing. Thank you for the quick turnaround on this fix @micmarty-deepsense @cragwolfe! 🙇 🙏 I can confirm that unstructured 0.13.5 fixes the issue in #2931.

adieuadieu avatar Apr 29 '24 09:04 adieuadieu