bug/bounding boxes using strategy="hi_res" are wrong
Describe the bug When using the coordinates of elements for bounding boxes, the coordinates are different using default strategy and 'hi_res' strategy.
To Reproduce
sudo apt-get install -y poppler-utils tesseract-ocr
pip install "unstructured[pdf]==0.12.5" PyMuPDF poppler-utils unstructured_inference==0.7.23
#Image.open() issue with higher version of unstructured_interface 0.7.24 has compatibility issue with unstructured 0.12.5 so downgrading to 0.7.23
# Partition the PDF into chunks
import fitz
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Element
elements_high_res = partition_pdf(
filename=document,
chunk_size=chunk_size,
extract_images_in_pdf=True,
extract_image_block_output_dir="/content/images",
strategy = "hi_res",
use_gpu=True
)
elements = partition_pdf(
filename=document,
chunk_size=chunk_size
)
document = "/content/1706.03762v7.pdf"
# Using hi_res strategy
output_pdf_path = "/content/1706.03762v7_modded_high_res.pdf"
chunk_size = 0
pdf_document = fitz.open(document)
for element in elements_high_res:
if isinstance(element, Element):
page_number = element.metadata.page_number
bbox = element.metadata.coordinates.to_dict()
top_left, bottom_right = bbox['points'][0], bbox['points'][2]
if page_number is not None and bbox is not None:
page = pdf_document[page_number - 1] # PyMuPDF uses 0-based indexing for pages
rect = fitz.Rect(top_left, bottom_right)
page.draw_rect(rect, color=(1, 0, 0), width=2) # Draw a red rectangle with a width of 2
# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()
# Using default strategy
output_pdf_path = "/content/1706.03762v7_modded.pdf"
chunk_size = 0
pdf_document = fitz.open(document)
for element in elements:
if isinstance(element, Element):
page_number = element.metadata.page_number
bbox = element.metadata.coordinates.to_dict()
top_left, bottom_right = bbox['points'][0], bbox['points'][2]
if page_number is not None and bbox is not None:
page = pdf_document[page_number - 1] # PyMuPDF uses 0-based indexing for pages
rect = fitz.Rect(top_left, bottom_right)
page.draw_rect(rect, color=(1, 0, 0), width=2) # Draw a red rectangle with a width of 2
# Save the modified PDF
pdf_document.save(output_pdf_path)
pdf_document.close()
[1706.03762v7_modded_high_res.pdf](https://github.com/Unstructured-IO/unstructured/files/15441444/1706.03762v7_modded_high_res.pdf)
[1706.03762v7_modded.pdf](https://github.com/Unstructured-IO/unstructured/files/15441445/1706.03762v7_modded.pdf)
[1712.05889v2.pdf](https://github.com/Unstructured-IO/unstructured/files/15441446/1712.05889v2.pdf)
[1706.03762v7.pdf](https://github.com/Unstructured-IO/unstructured/files/15441447/1706.03762v7.pdf)
Expected behavior The bounding boxes should not change due to the strategy change
Screenshots
Screenshots are attached as PDF
but still here is a screenshot:
Default strategy
high res strategy
Environment Info
Please run python scripts/collect_env.py and paste the output here.
This will help us understand more about the environment in which the bug occurred.
Public workbook link https://colab.research.google.com/drive/1z2dwE9t6zsgTcejx9RQzj_nTDHOdS4Vv?usp=sharing
Additional context None
@leah1985 - Does this seem like an issue with the model output or a pre/post-processing issue?
@MthwRobinson I think this is not a "hi_res" strategy issue but a "fast" strategy issue due to CoordinateSystem. I'll take a closer look at this issue.
Sounds good - thanks!
From my own experience, hi-res uses the coordinates of the output of converting the pdf to an image, which is not something that the fast method has to do. The pixel density of converting to img first is much higher, resulting in coordinates that are outside of the page for an document loaded with fritz. Please load the image with from unstructured_inference.inference.layout import convert_pdf_to_image to get the right format and coordinate system.
@Baukebrenninkmeijer Hi, thank you for the comment. Have you figured out a way/workaround to resolve this? Thanks
@AndrewTsai0406 the thing I suggested, loading the image through that function, was my workaround. I'm not currently using it, so cannot help more i'm afraid.
Sorry, we won't be fixing it, at least for now. You should use hi_res if you want to get accurate bboxes.
From my own experience, hi-res uses the coordinates of the output of converting the pdf to an image, which is not something that the
fastmethod has to do. The pixel density of converting to img first is much higher, resulting in coordinates that are outside of the page for an document loaded with fritz. Please load the image withfrom unstructured_inference.inference.layout import convert_pdf_to_imageto get the right format and coordinate system.
can confirm that this is working
We are experiencing the same problem with the HI_RES strategy!
Code:
filename = "exmaple.pdf"
with open(filename, "rb") as f:
data = f.read()
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=shared.Files(
content=data,
file_name=filename,
),
strategy=shared.Strategy.HI_RES,
coordinates = True,
languages=['de'],
),
)
try:
res = client.general.partition(request=req)
print(res.elements[0])
except Exception as e:
print(e)
This leads to incorrect bounding boxes, similar to @mandar-karhade 's initial post.
When I use the default strategy with the following code (omitting the strategy parameter) or setting strategy=shared.Strategy.FAST I do not get any bounding boxes anymore - even with coordinates = True
req = operations.PartitionRequest(
partition_parameters=shared.PartitionParameters(
files=shared.Files(
content=data,
file_name=filename,
),
coordinates = True,
languages=['de'],
),
)
What is the suggested way to retrieve correct bounding boxes with any of the strategies?