unstructured
unstructured copied to clipboard
bug/`_partition_pdf_or_image_local` expects a PDF file input
Describe the bug
This one is not particulary breaking anything noticeable so far but is confusing method(s) name: from unstructured.partition.pdf import _partition_pdf_or_image_local
now always expects a pdf so when using it with an image it throws a PDFSyntaxError: No /Root object! - Is this really a PDF?
from pdfminer
. I assume that images are taken care of in another method but isn't the name of the function _partition_pdf_or_image_local
confusing now? In the main library, the method in question is called from paritition_pdf_or_image for instance... maybe the method needs to be renamed and that's all?
To Reproduce
from unstructured.partition.pdf import _partition_pdf_or_image_local
model_name="chipper"
elements = _partition_pdf_or_image_local(filename='layout-parser-paper-fast.jpg', model_name=model_name)
(Can model_name="chipper"
here be related to the bug?)
Screenshots
Environment Info
Run in a colab with unstructured 0.10.9
.
actually you can pass in image but need to specify is_image=True
like
elements = _partition_pdf_or_image_local(filename=f, model_name=model_name, is_image=True)
now the question is should the function be automatically determining if an input is image? The name definitely makes is sound like it can...
yes, I think that makes sense, checking if is an image is something we could add to the method but still leaving the param is_image
so we avoid refactoring other calls/test to the method that potentially use it.
(this function is not intended to be used by users and is not breaking anything anyways)
Status update on this being worked in the backlog? If not planned to complete, will be marked as complete in 2 weeks
Closing since this is a non-public function.