unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/macos-arm64-and-lxml-import-error

Open liamvdv opened this issue 1 year ago • 4 comments

Describe the bug Cannot use unstructured on MacOS M2 Pro because from unstructured.partition.html import partition_html throws

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/partition/html.py", line 7, in <module>
    from unstructured.documents.html import HTMLDocument
  File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/documents/html.py", line 11, in <module>
    from lxml import etree
ImportError: dlopen(/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '___cyg_profile_func_enter'

To Reproduce

brew install [email protected]
python3.9 -m venv .venv
source .venv/bin/activate
python --version # should show 3.9 now
which python # should be .../.venv/bin/....
pip install unstructured
python
# in interactive shell
from unstructured.partition.html import partition_html

on my machine throws

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/partition/html.py", line 7, in <module>
    from unstructured.documents.html import HTMLDocument
  File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/documents/html.py", line 11, in <module>
    from lxml import etree
ImportError: dlopen(/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '___cyg_profile_func_enter'

Expected behavior Normal import to then parse HTML/XML files.

Screenshots image

Environment Info

/Users/liamvdv/src/github.com/REDACT/collect.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
OS version:  macOS-13.4.1-arm64-arm-64bit
Python version:  3.9.18
unstructured version:  0.10.19
unstructured-inference is not installed
pytesseract is not installed
Torch is not installed
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
Traceback (most recent call last):
  File "/Users/liamvdv/src/github.com/REDACT/collect.py", line 242, in <module>
    main()
  File "/Users/liamvdv/src/github.com/REDACT/collect.py", line 234, in main
    libreoffice_version = get_libreoffice_version()
  File "/Users/liamvdv/src/github.com/REDACT/collect.py", line 163, in get_libreoffice_version
    result = subprocess.run(
  File "/opt/homebrew/Cellar/[email protected]/3.9.18/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/opt/homebrew/Cellar/[email protected]/3.9.18/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/opt/homebrew/Cellar/[email protected]/3.9.18/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'

Additional context I'm on a Mac M2 Pro with macOS Version 13.4.1.

Thank you for your help.

liamvdv avatar Oct 11 '23 09:10 liamvdv

Hi @liamvdv sorry for a late response; we are tracking this and reviewing the problem. Will keep this thread updated.

badGarnet avatar Oct 20 '23 20:10 badGarnet

This thread hasn't been updated. Is it fixed?

MikeRecognex avatar Apr 14 '24 08:04 MikeRecognex

having same issue too

zihaolam avatar Apr 16 '24 13:04 zihaolam

After getting an Apple Silicon Mac I was finally able to reproduce this error.

I believe the problem is that arm64 wheels are not available for the latest versions of lxml. The solution that worked for me is the following:

$ pip install lxml==4.9.2

The later versions of lxml have "universal" macOS wheels and for some reason those don't seem to work.

scanny avatar Apr 28 '24 01:04 scanny

Closing this issue. You can try @scanny 's suggestion from above if you run into this issue.

MthwRobinson avatar May 16 '24 14:05 MthwRobinson