unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Better element IDs - deterministic and document-unique hashes

Open micmarty-deepsense opened this issue 11 months ago • 1 comments

Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:

  • hash computation includes element's position in a sequence of all elements
  • there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents)

This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461

micmarty-deepsense avatar Mar 20 '24 16:03 micmarty-deepsense

@scanny I've addressed all of your comments, if it looks good to you now, let's merge it today, right after this https://github.com/Unstructured-IO/unstructured/pull/2842

micmarty-deepsense avatar Apr 09 '24 12:04 micmarty-deepsense

Let's think again about this in light of parent as object reference rather than parent-id and not merge yet.

I'd like to tackle this in a separate PR, there're too many changes in this PR already I think 🦖

micmarty-deepsense avatar Apr 18 '24 18:04 micmarty-deepsense

@micmarty-deepsense

I'd like to tackle this in a separate PR, there're too many changes in this PR already I think 🦖

I agree. This PR did a lot for us, but now all that discussion is resolved I think we're far better off just letting this one go and starting a fresh one :)

scanny avatar Apr 18 '24 19:04 scanny

@scanny It looks like I need to update all the element_id values for the spring-weather.html.json file, am I right? Here's the job that fails: https://github.com/Unstructured-IO/unstructured/actions/runs/8742887283/job/23996742775

image

micmarty-deepsense avatar Apr 18 '24 21:04 micmarty-deepsense

@scanny It looks like I need to update all the element_id values for the spring-weather.html.json file, am I right? Here's the job that fails: https://github.com/Unstructured-IO/unstructured/actions/runs/8742887283/job/23996742775

yep.

cragwolfe avatar Apr 18 '24 21:04 cragwolfe

@micmarty-deepsense

@scanny It looks like I need to update all the element_id values for the spring-weather.html.json file, am I right?

I'd be inclined to solve this by:

  • getting the test running locally
  • comment out most of it just leaving a part that does elements = elements_from_json(filename)
  • call the equivalent of e.id_to_hash() for e in elements
  • write that out to a new file with something like elements_to_json(new_filename)

and see if that does the bulk of the work for you :)

scanny avatar Apr 19 '24 05:04 scanny

@scanny It looks like I need to update all the element_id values for the spring-weather.html.json file, am I right? Here's the job that fails: https://github.com/Unstructured-IO/unstructured/actions/runs/8742887283/job/23996742775

yep.

actually, wait. this file is in test_unstructured_ingest/expected-structured-output , where element_id's and all json should be updated by the github action https://github.com/Unstructured-IO/unstructured/actions/workflows/ingest-test-fixtures-update-pr.yml (need to click the Run Workflow button in the top right and select your branch).

cragwolfe avatar Apr 19 '24 05:04 cragwolfe