unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/`_partition_pdf_or_image_local` expects a PDF file input

Open LaverdeS opened this issue 1 year ago • 3 comments

Describe the bug This one is not particulary breaking anything noticeable so far but is confusing method(s) name: from unstructured.partition.pdf import _partition_pdf_or_image_local now always expects a pdf so when using it with an image it throws a PDFSyntaxError: No /Root object! - Is this really a PDF? from pdfminer. I assume that images are taken care of in another method but isn't the name of the function _partition_pdf_or_image_local confusing now? In the main library, the method in question is called from paritition_pdf_or_image for instance... maybe the method needs to be renamed and that's all?

To Reproduce

from unstructured.partition.pdf import _partition_pdf_or_image_local
model_name="chipper"
elements = _partition_pdf_or_image_local(filename='layout-parser-paper-fast.jpg', model_name=model_name)

(Can model_name="chipper" here be related to the bug?)

Screenshots image

Environment Info Run in a colab with unstructured 0.10.9.

LaverdeS avatar Aug 30 '23 14:08 LaverdeS

actually you can pass in image but need to specify is_image=True like

elements = _partition_pdf_or_image_local(filename=f, model_name=model_name, is_image=True)

now the question is should the function be automatically determining if an input is image? The name definitely makes is sound like it can...

badGarnet avatar Sep 03 '23 23:09 badGarnet

yes, I think that makes sense, checking if is an image is something we could add to the method but still leaving the param is_image so we avoid refactoring other calls/test to the method that potentially use it. (this function is not intended to be used by users and is not breaking anything anyways)

LaverdeS avatar Sep 05 '23 14:09 LaverdeS

Status update on this being worked in the backlog? If not planned to complete, will be marked as complete in 2 weeks

orlandounstructured avatar Feb 12 '24 19:02 orlandounstructured

Closing since this is a non-public function.

MthwRobinson avatar May 10 '24 12:05 MthwRobinson