unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Suggestion: include consolidated bounding box coordinates in chunk metadata when using "by_title" chunking strategy

Open m-kemarskyi opened this issue 1 year ago • 3 comments

Problem Currently when "by_title" chunking strategy is used and coordinates = true parameter is set (in order to return coordinates of the PDF chunks), coordinates are not returned (because in this strategy separate chunks are joined under the hood, which may span multiple pages).

"by_title" strategy is really useful because "default" strategy often returns really small chunks (containing one word or a couple of words). Therefore, inability to use coordinates with "by_title" strategy blocks use cases which require coordinates of text blocks in PDF files.

Suggestion The suggestion is to return consolidated bounding box coordinates when "by_title" chunking strategy is used, returning a rectangle with extreme coordinates of the included chunks if multipage_sections = False parameter is passed (therefore chunks cannot span multiple pages and Unstructured.io API can calculate bounding box coordinates on the single page).

Additional context The issue was discussed here: https://github.com/Unstructured-IO/unstructured/issues/1698

m-kemarskyi avatar Jun 11 '24 15:06 m-kemarskyi

As @scanny suggested, some "by_page" strategy can also be added (as far as I understood it implies "by_title" + multipage_sections = False)

m-kemarskyi avatar Jun 11 '24 15:06 m-kemarskyi

@awalker4 Also it turns out that multipage_sections parameter is not working at the moment (tried on the latest API version of 0.0.72, chunking_strategy = by_title)

m-kemarskyi avatar Jul 08 '24 08:07 m-kemarskyi

Hi all, is there any update on this suggestion yet? It seems that when I don't use chunking_strategy=by_title, each element return each line of the document with its coordinates but when I use chunking all the coordinates return None. I'm using the pip package v0.1.6

thanh-px avatar Jan 09 '25 07:01 thanh-px