unstructured
unstructured copied to clipboard
Enhancement: better element ID's
Is your feature request related to a problem? Please describe.
Currently, element_id's are simply a hash of the element's text. This is not great, since id's may then be duplicated within a page or document.
Proposal
Deterministic element ID's should be hash of (text, page_num, seq_no_in_page). Then, element_id's would be unique (with extremely high probability) within a document. If processing pages in parallel, element_id's should be consistent as if they were processed in serial (how they are currently processed) instead.
This implies that metadata_page_number_begin
must also be an optional parameter for partition()
, and, the API.
Other considerations
Hashing with other metadata is potentially fair game, to attempt to keep ID's distinct between documents. Determinism is a must, however.
Initially, this implementation would not effect the partition parameter: unique_element_ids=True.
I have an issue with this as well, as parent ids are wrongly set because of this