unstructured
unstructured copied to clipboard
feat/group elements by parent_id
Is your feature request related to a problem? Please describe.
Following up with the document hierarchy implementation, it'll be helpful to have a built-in function to group elements with the same parent_id
.
Describe the solution you'd like
Similar to chunk_by_title, but the parent type is not always a Title
.
Describe alternatives you've considered
Group elements with the same parent_id
and assign the previous element as the parent where parent_id
is None.
Additional context n/a
Is there any news on that, or a work around to group chunks by parent id?
The current chunk_by_title function does not retain parent-child relationship. Often a parent is grouped into previous chunk, even though itself is not a child of the previous chunk. Would like to see a new method that will respect parent/child relationship.
also looking for such a function.
If anyone is interested in picking this up as a first issue, I think it would make sense in unstructured/utils.py
If anyone is interested in picking this up as a first issue, I think it would make sense in
unstructured/utils.py
hi. @MthwRobinson , thanks for your information, do you mean it?
def is_parent_box(parent_target: Box, child_target: Box, add: float = 0.0) -> bool: '''True if the child_target bounding box is nested in the parent_target.
Box format: [x_bottom_left, y_bottom_left, x_top_right, y_top_right]. The parameter 'add' is the pixel error tolerance for extra pixels outside the parent region'''