unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/bounding boxes using strategy="hi_res" are wrong

Open mandar-karhade opened this issue 1 year ago • 6 comments

Describe the bug When using the coordinates of elements for bounding boxes, the coordinates are different using default strategy and 'hi_res' strategy.

To Reproduce

sudo apt-get install -y poppler-utils  tesseract-ocr
pip install "unstructured[pdf]==0.12.5" PyMuPDF poppler-utils unstructured_inference==0.7.23 
#Image.open() issue with higher version of unstructured_interface 0.7.24 has compatibility issue with unstructured 0.12.5 so downgrading to 0.7.23 

# Partition the PDF into chunks
import fitz
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Element

elements_high_res = partition_pdf(
                        filename=document, 
                        chunk_size=chunk_size, 
                        extract_images_in_pdf=True,
                        extract_image_block_output_dir="/content/images",
                        strategy = "hi_res",
                        use_gpu=True
                         )

elements = partition_pdf(
                        filename=document, 
                        chunk_size=chunk_size
                         )

document = "/content/1706.03762v7.pdf"

# Using hi_res strategy
output_pdf_path = "/content/1706.03762v7_modded_high_res.pdf"
chunk_size = 0 
pdf_document = fitz.open(document)

for element in elements_high_res:
    if isinstance(element, Element):
        page_number = element.metadata.page_number
        bbox = element.metadata.coordinates.to_dict()
        top_left, bottom_right = bbox['points'][0], bbox['points'][2]
        if page_number is not None and bbox is not None:
            page = pdf_document[page_number - 1]  # PyMuPDF uses 0-based indexing for pages
            rect = fitz.Rect(top_left, bottom_right)
            page.draw_rect(rect, color=(1, 0, 0), width=2)  # Draw a red rectangle with a width of 2

# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()

# Using default strategy
output_pdf_path = "/content/1706.03762v7_modded.pdf"
chunk_size = 0 
pdf_document = fitz.open(document)

for element in elements:
    if isinstance(element, Element):
        page_number = element.metadata.page_number
        bbox = element.metadata.coordinates.to_dict()
        top_left, bottom_right = bbox['points'][0], bbox['points'][2]
        if page_number is not None and bbox is not None:
            page = pdf_document[page_number - 1]  # PyMuPDF uses 0-based indexing for pages
            rect = fitz.Rect(top_left, bottom_right)
            page.draw_rect(rect, color=(1, 0, 0), width=2)  # Draw a red rectangle with a width of 2

# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()
[1706.03762v7_modded_high_res.pdf](https://github.com/Unstructured-IO/unstructured/files/15441444/1706.03762v7_modded_high_res.pdf)
[1706.03762v7_modded.pdf](https://github.com/Unstructured-IO/unstructured/files/15441445/1706.03762v7_modded.pdf)
[1712.05889v2.pdf](https://github.com/Unstructured-IO/unstructured/files/15441446/1712.05889v2.pdf)
[1706.03762v7.pdf](https://github.com/Unstructured-IO/unstructured/files/15441447/1706.03762v7.pdf)

Expected behavior The bounding boxes should not change due to the strategy change

Screenshots Screenshots are attached as PDF but still here is a screenshot: Default strategy default_strategy high res strategy hi_res_strategy

Environment Info Please run python scripts/collect_env.py and paste the output here. This will help us understand more about the environment in which the bug occurred. Public workbook link https://colab.research.google.com/drive/1z2dwE9t6zsgTcejx9RQzj_nTDHOdS4Vv?usp=sharing

Additional context None

mandar-karhade avatar May 25 '24 03:05 mandar-karhade

@leah1985 - Does this seem like an issue with the model output or a pre/post-processing issue?

MthwRobinson avatar May 28 '24 12:05 MthwRobinson

@MthwRobinson I think this is not a "hi_res" strategy issue but a "fast" strategy issue due to CoordinateSystem. I'll take a closer look at this issue.

christinestraub avatar Jun 24 '24 22:06 christinestraub

Sounds good - thanks!

MthwRobinson avatar Jun 25 '24 11:06 MthwRobinson

From my own experience, hi-res uses the coordinates of the output of converting the pdf to an image, which is not something that the fast method has to do. The pixel density of converting to img first is much higher, resulting in coordinates that are outside of the page for an document loaded with fritz. Please load the image with from unstructured_inference.inference.layout import convert_pdf_to_image to get the right format and coordinate system.

Baukebrenninkmeijer avatar Jul 24 '24 14:07 Baukebrenninkmeijer

@Baukebrenninkmeijer Hi, thank you for the comment. Have you figured out a way/workaround to resolve this? Thanks

AndrewTsai0406 avatar Sep 18 '24 06:09 AndrewTsai0406

@AndrewTsai0406 the thing I suggested, loading the image through that function, was my workaround. I'm not currently using it, so cannot help more i'm afraid.

Baukebrenninkmeijer avatar Sep 18 '24 08:09 Baukebrenninkmeijer

Sorry, we won't be fixing it, at least for now. You should use hi_res if you want to get accurate bboxes.

hubert-rutkowski85 avatar Dec 17 '24 17:12 hubert-rutkowski85

From my own experience, hi-res uses the coordinates of the output of converting the pdf to an image, which is not something that the fast method has to do. The pixel density of converting to img first is much higher, resulting in coordinates that are outside of the page for an document loaded with fritz. Please load the image with from unstructured_inference.inference.layout import convert_pdf_to_image to get the right format and coordinate system.

can confirm that this is working

EriSetyawan166 avatar Jan 11 '25 07:01 EriSetyawan166

We are experiencing the same problem with the HI_RES strategy!

Code:

filename = "exmaple.pdf"
with open(filename, "rb") as f:
    data = f.read()

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=data,
            file_name=filename,
        ),
        strategy=shared.Strategy.HI_RES,  
        coordinates = True,
        languages=['de'],
    ),
)

try:
    res = client.general.partition(request=req)
    print(res.elements[0])
except Exception as e:
    print(e)

This leads to incorrect bounding boxes, similar to @mandar-karhade 's initial post.

When I use the default strategy with the following code (omitting the strategy parameter) or setting strategy=shared.Strategy.FAST I do not get any bounding boxes anymore - even with coordinates = True

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=data,
            file_name=filename,
        ),
        coordinates = True,
        languages=['de'],
    ),
)

What is the suggested way to retrieve correct bounding boxes with any of the strategies?

charlottecrnj avatar Jan 16 '25 13:01 charlottecrnj