unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Enhancement: better element ID's

Open cragwolfe opened this issue 5 months ago • 1 comments

Is your feature request related to a problem? Please describe.

Currently, element_id's are simply a hash of the element's text. This is not great, since id's may then be duplicated within a page or document.

Proposal

Deterministic element ID's should be hash of (text, page_num, seq_no_in_page). Then, element_id's would be unique (with extremely high probability) within a document. If processing pages in parallel, element_id's should be consistent as if they were processed in serial (how they are currently processed) instead.

This implies that metadata_page_number_begin must also be an optional parameter for partition(), and, the API.

Other considerations

Hashing with other metadata is potentially fair game, to attempt to keep ID's distinct between documents. Determinism is a must, however.

Initially, this implementation would not effect the partition parameter: unique_element_ids=True.

cragwolfe avatar Jan 26 '24 05:01 cragwolfe