unstructured
unstructured copied to clipboard
bug/macos-arm64-and-lxml-import-error
Describe the bug
Cannot use unstructured on MacOS M2 Pro because from unstructured.partition.html import partition_html
throws
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/partition/html.py", line 7, in <module>
from unstructured.documents.html import HTMLDocument
File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/documents/html.py", line 11, in <module>
from lxml import etree
ImportError: dlopen(/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '___cyg_profile_func_enter'
To Reproduce
brew install [email protected]
python3.9 -m venv .venv
source .venv/bin/activate
python --version # should show 3.9 now
which python # should be .../.venv/bin/....
pip install unstructured
python
# in interactive shell
from unstructured.partition.html import partition_html
on my machine throws
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/partition/html.py", line 7, in <module>
from unstructured.documents.html import HTMLDocument
File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/documents/html.py", line 11, in <module>
from lxml import etree
ImportError: dlopen(/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/lxml/etree.cpython-39-darwin.so, 0x0002): symbol not found in flat namespace '___cyg_profile_func_enter'
Expected behavior Normal import to then parse HTML/XML files.
Screenshots
Environment Info
/Users/liamvdv/src/github.com/REDACT/collect.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources
OS version: macOS-13.4.1-arm64-arm-64bit
Python version: 3.9.18
unstructured version: 0.10.19
unstructured-inference is not installed
pytesseract is not installed
Torch is not installed
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
Traceback (most recent call last):
File "/Users/liamvdv/src/github.com/REDACT/collect.py", line 242, in <module>
main()
File "/Users/liamvdv/src/github.com/REDACT/collect.py", line 234, in main
libreoffice_version = get_libreoffice_version()
File "/Users/liamvdv/src/github.com/REDACT/collect.py", line 163, in get_libreoffice_version
result = subprocess.run(
File "/opt/homebrew/Cellar/[email protected]/3.9.18/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 505, in run
with Popen(*popenargs, **kwargs) as process:
File "/opt/homebrew/Cellar/[email protected]/3.9.18/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 951, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/opt/homebrew/Cellar/[email protected]/3.9.18/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 1837, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'libreoffice'
Additional context I'm on a Mac M2 Pro with macOS Version 13.4.1.
Thank you for your help.
Hi @liamvdv sorry for a late response; we are tracking this and reviewing the problem. Will keep this thread updated.
This thread hasn't been updated. Is it fixed?
having same issue too
After getting an Apple Silicon Mac I was finally able to reproduce this error.
I believe the problem is that arm64
wheels are not available for the latest versions of lxml
. The solution that worked for me is the following:
$ pip install lxml==4.9.2
The later versions of lxml
have "universal" macOS wheels and for some reason those don't seem to work.
Closing this issue. You can try @scanny 's suggestion from above if you run into this issue.