unstructured
unstructured copied to clipboard
Quickfix for elements sharing the same memory address
This PR attempts to fix a memory issue, which resulted in errors like this: https://github.com/Unstructured-IO/unstructured/issues/2931 The root cause seems to be in how ListItems are being combined, not in how hashes or parent IDs are updated.
When assign_and_map_hash_ids()
is called and elements (or elements' metadata) do not have unique memory addresses, then updating the parent_id of one element will also overwrite the parent_id of some other element.
I checked out to your branch and verified this does not raise the KeyError, but can you add a test? I suppose it could be a follow-up PR. Approving in advance
https://github.com/Unstructured-IO/unstructured/actions/runs/8850164463/job/24304063524
=========================== short test summary info ============================
FAILED test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_word_bbox_not_char - assert 18 == 17
+ where 18 = len([<unstructured.documents.elements.Header object at 0x7ff8e4d241d0>, <unstructured.documents.elements.Header object at ...NarrativeText object at 0x7ff8e4d0eb10>, <unstructured.documents.elements.NarrativeText object at 0x7ff8f8a89ad0>, ...])
= 1 failed, 2231 passed, 13 skipped, 2 deselected, 2 xfailed, 9 xpassed, 8 warnings in 1170.99s (0:19:30) =
. either this test was incorrect before, or the def _combine_list_elements
change broke something.
let's make this a real release (not -dev
) when tests pass.
Amazing. Thank you for the quick turnaround on this fix @micmarty-deepsense @cragwolfe! 🙇 🙏 I can confirm that unstructured 0.13.5 fixes the issue in #2931.