unstructured bug/opencv-python should be `headless` to avoid dependency on Xorg

Describe the bug

Getting following error when loading PDF files on a container image to be hosted in cloud:

  ...
  File "/DATA/junk/test2/lib/python3.11/site-packages/unstructured/partition/auto.py", line 81, in <module>
    from unstructured.partition.pdf import partition_pdf
  File "/DATA/junk/test2/lib/python3.11/site-packages/unstructured/partition/pdf.py", line 76, in <module>
    from unstructured.partition.ocr import (
  File "/DATA/junk/test2/lib/python3.11/site-packages/unstructured/partition/ocr.py", line 6, in <module>
    import cv2
ImportError: libGL.so.1: cannot open shared object file: No such file or directory

However libGL.so.1 is part of Xorg binaries. We could switch to a full Linux distro to resolve this, but a better option is to have opencv-python-headless in dependency requirements instead of opencv-python.

Feb 04 '24 13:02 tigerinus

Having the same issue when importing partition_pdf

from unstructured.partition.pdf import partition_pdf

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[8], line 2
      1 import os
----> 2 from unstructured.partition.pdf import partition_pdf
      3 from unstructured.staging.base import elements_to_json

File /opt/conda/lib/python3.10/site-packages/unstructured/partition/pdf.py:77
     64 from unstructured.partition.common import (
     65     convert_to_bytes,
     66     document_to_element_list,
   (...)
     71     spooled_to_bytes_io_if_needed,
     72 )
     73 from unstructured.partition.lang import (
     74     check_language_args,
     75     prepare_languages_for_tesseract,
     76 )
---> 77 from unstructured.partition.pdf_image.pdf_image_utils import (
     78     annotate_layout_elements,
     79     check_element_types_to_extract,
     80     save_elements,
     81 )
     82 from unstructured.partition.pdf_image.pdfminer_processing import (
     83     merge_inferred_with_extracted_layout,
     84 )
     85 from unstructured.partition.pdf_image.pdfminer_utils import (
     86     open_pdfminer_pages_generator,
     87     rect_to_bbox,
     88 )

File /opt/conda/lib/python3.10/site-packages/unstructured/partition/pdf_image/pdf_image_utils.py:9
      6 from pathlib import PurePath
      7 from typing import TYPE_CHECKING, BinaryIO, List, Optional, Tuple, Union, cast
----> 9 import cv2
     10 import numpy as np
     11 import pdf2image

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Feb 13 '24 23:02 mhfarahani

Is there a workaround @tigerinus ?

Mar 10 '24 07:03 adi-kmt

@tigerinus, @mhfarahani what base image are you using? I'd like to replicate the described behavior on my side

Mar 12 '24 15:03 micmarty-deepsense

@tigerinus, @mhfarahani what base image are you using? I'd like to replicate the described behavior on my side

any distro that doesn't come with the required binary libGL.so.1 should be able to reproduce this issue

In our case, it's a highly customized embedded linux (buildroot based).

Mar 13 '24 05:03 tigerinus

As far as I can tell, there's a quite relevant dependency: layoutparser which relies on opencv-python. I've seen that there is a similar request to yours: https://github.com/Layout-Parser/layout-parser/issues/170

We have two options: a) we'd need to create a PR in their package, or b) let them know that it's important/pressuring to introduce the headless version in their repo and wait until it's fixed there

@tigerinus @adi-kmt @mhfarahani If you need a workaround now, I'd say you should modify your Dockerfiles in the following way:

# install unstructured library as usual

# uninstall the full version, install headless
RUN pip uninstall -y opencv-python opencv-contrib-python && pip install opencv-python-headless==4.8.0.76

if opencv-python-headless is not sufficient, try with opencv-contrib-python-headless

Please let me know if that helps 🤝

Mar 13 '24 09:03 micmarty-deepsense

Facing the same problem. The workaround works, thank you!

Mar 25 '24 20:03 FilippTrigub

I've tried the workaround but now the error when importing partition_pdf is: ModuleNotFoundError: No module named 'cv2.typing'; 'cv2' is not a package

Apr 22 '24 10:04 laurazpm

Hitting the same issue. Is there any news on whether this could be changed to the headless version?

May 07 '24 05:05 Robs-Git-Hub

Thanks everyone, we're going to take a look at this.

May 23 '24 15:05 MthwRobinson

The workaround works, just make sure that you do your uninstall after you've done your requirements install

RUN pip install  -r requirements.txt
RUN pip uninstall -y opencv-python opencv-contrib-python && pip install opencv-python-headless==4.8.0.76

*edited as I forgot what project I was looking at

May 23 '24 19:05 pjaol

unstructured unstructured copied to clipboard

bug/opencv-python should be `headless` to avoid dependency on Xorg

unstructured
unstructured copied to clipboard