unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

feat/group elements by parent_id

Open ron-unstructured opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe. Following up with the document hierarchy implementation, it'll be helpful to have a built-in function to group elements with the same parent_id.

Describe the solution you'd like Similar to chunk_by_title, but the parent type is not always a Title.

Describe alternatives you've considered Group elements with the same parent_id and assign the previous element as the parent where parent_id is None.

Additional context n/a

ron-unstructured avatar Sep 21 '23 15:09 ron-unstructured

Is there any news on that, or a work around to group chunks by parent id?

dantes-ai avatar Oct 29 '23 14:10 dantes-ai

The current chunk_by_title function does not retain parent-child relationship. Often a parent is grouped into previous chunk, even though itself is not a child of the previous chunk. Would like to see a new method that will respect parent/child relationship.

qy2144 avatar Feb 15 '24 00:02 qy2144

also looking for such a function.

weissenbacherpwc avatar May 14 '24 05:05 weissenbacherpwc

If anyone is interested in picking this up as a first issue, I think it would make sense in unstructured/utils.py

MthwRobinson avatar May 15 '24 13:05 MthwRobinson

If anyone is interested in picking this up as a first issue, I think it would make sense in unstructured/utils.py

hi. @MthwRobinson , thanks for your information, do you mean it?

def is_parent_box(parent_target: Box, child_target: Box, add: float = 0.0) -> bool: '''True if the child_target bounding box is nested in the parent_target.

Box format: [x_bottom_left, y_bottom_left, x_top_right, y_top_right]. The parameter 'add' is the pixel error tolerance for extra pixels outside the parent region'''

huangpan2507 avatar Aug 13 '24 02:08 huangpan2507