unstructured
unstructured copied to clipboard
Better element IDs - deterministic and document-unique hashes
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842
Main changes compared to part one:
- hash computation includes element's position in a sequence of all elements
- there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents)
This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461
@scanny I've addressed all of your comments, if it looks good to you now, let's merge it today, right after this https://github.com/Unstructured-IO/unstructured/pull/2842
Let's think again about this in light of parent as object reference rather than parent-id and not merge yet.
I'd like to tackle this in a separate PR, there're too many changes in this PR already I think 🦖
@micmarty-deepsense
I'd like to tackle this in a separate PR, there're too many changes in this PR already I think 🦖
I agree. This PR did a lot for us, but now all that discussion is resolved I think we're far better off just letting this one go and starting a fresh one :)
@scanny
It looks like I need to update all the element_id
values for the spring-weather.html.json
file, am I right?
Here's the job that fails: https://github.com/Unstructured-IO/unstructured/actions/runs/8742887283/job/23996742775
@scanny It looks like I need to update all the
element_id
values for thespring-weather.html.json
file, am I right? Here's the job that fails: https://github.com/Unstructured-IO/unstructured/actions/runs/8742887283/job/23996742775
yep.
@micmarty-deepsense
@scanny It looks like I need to update all the
element_id
values for thespring-weather.html.json
file, am I right?
I'd be inclined to solve this by:
- getting the test running locally
- comment out most of it just leaving a part that does
elements = elements_from_json(filename)
- call the equivalent of
e.id_to_hash() for e in elements
- write that out to a new file with something like
elements_to_json(new_filename)
and see if that does the bulk of the work for you :)
@scanny It looks like I need to update all the
element_id
values for thespring-weather.html.json
file, am I right? Here's the job that fails: https://github.com/Unstructured-IO/unstructured/actions/runs/8742887283/job/23996742775yep.
actually, wait. this file is in test_unstructured_ingest/expected-structured-output , where element_id's and all json should be updated by the github action https://github.com/Unstructured-IO/unstructured/actions/workflows/ingest-test-fixtures-update-pr.yml (need to click the Run Workflow
button in the top right and select your branch).