unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

facing problem in partition_pdf

Open Rittik003 opened this issue 11 months ago • 2 comments

from unstructured.partition.pdf import partition_pdf after doing this

error:

Cell In[7], line 1 ----> 1 from unstructured.partition.pdf import partition_pdf

File c:\Users\ASUS\anaconda3\Lib\site-packages\unstructured\partition\pdf.py:17 15 from pdfminer.layout import LTContainer, LTImage, LTItem, LTTextBox 16 from pdfminer.utils import open_filename ---> 17 from pi_heif import register_heif_opener 18 from PIL import Image as PILImage 19 from pypdf import PdfReader

ModuleNotFoundError: No module named 'pi_heif'

then i have done this !pip install "unstructured[all-docs]"

Now getting this error ImportError: DLL load failed while importing onnx_cpp2py_export: A dynamic link library (DLL) initialization routine failed.

Rittik003 avatar Jan 17 '25 15:01 Rittik003

This library has lot of dependencies, but no clear documentation is available for the same, I am currently getting the below error,

I need to perform some extraction before feeding it back to LLM, kindly let me know how to solve the same,

580 env["LD_LIBRARY_PATH"] = poppler_path + ":" + env.get("LD_LIBRARY_PATH", "") --> 581 proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE) 583 try:

File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\subprocess.py:971, in Popen.init(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize) 968 self.stderr = io.TextIOWrapper(self.stderr, 969 encoding=encoding, errors=errors) --> 971 self._execute_child(args, executable, preexec_fn, close_fds, 972 pass_fds, cwd, env, 973 startupinfo, creationflags, shell, 974 p2cread, p2cwrite, 975 c2pread, c2pwrite, 976 errread, errwrite, 977 restore_signals, 978 gid, gids, uid, umask, 979 start_new_session) 980 except: 981 # Cleanup if the child failed starting.

File C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.3056.0_x64__qbz5n2kfra8p0\lib\subprocess.py:1456, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_gid, unused_gids, unused_uid, unused_umask, unused_start_new_session) 1455 try: -> 1456 hp, ht, pid, tid = _winapi.CreateProcess(executable, args, ... 611 raise PDFPageCountError( 612 f"Unable to get page count.\n{err.decode('utf8', 'ignore')}" 613 )

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Image

AloyBanerjee avatar Jan 18 '25 03:01 AloyBanerjee

https://github.com/Belval/pdf2image

Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.

taylorn-ai avatar Feb 10 '25 05:02 taylorn-ai