unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/unstructured-ingest-hanging

Open gndctl-mehul opened this issue 4 months ago • 1 comments

Describe the bug Running the unstructured-ingest cli command and it is hanging. I think that it is treating the root page as a Page Block and trying to parse it, at which point it hangs.

To Reproduce We are investigating this and will update with details.

The best we have to offer for now is that we're running recursive mode and providing a page ID that is the child of a database, where the database's parent is the workspace.

Expected behavior The ingest should throw an error, or run to completion.

Screenshots No screenshots or logs available.

Environment Info We're using the Docker Container:

docker image ls | grep unstructured
downloads.unstructured.io/unstructured-io/unstructured   latest    104a18d9e603   3 days ago     8.17GB

I couldn't find the script in the container, but I copied it in and executed it. A few dependency errors but otherwise looks like it collected the info you need here

python3 collect.py
/home/notebook-user/collect.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
OS version:  Linux-6.4.16-linuxkit-aarch64-with-glibc2.34
Python version:  3.10.13
unstructured version:  0.12.5
unstructured-inference version:  0.7.23
pytesseract version:  0.3.10
Torch version:  2.2.0
Detectron2 is not installed

[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: pip install --upgrade pip

[notice] A new release of pip is available: 23.2.1 -> 24.0
[notice] To update, run: pip install --upgrade pip
PaddleOCR is not installed
Traceback (most recent call last):
  File "/home/notebook-user/collect.py", line 242, in <module>
    main()
  File "/home/notebook-user/collect.py", line 224, in main
    libmagic_version = get_libmagic_version()
  File "/home/notebook-user/collect.py", line 146, in get_libmagic_version
    result = subprocess.run(
  File "/usr/local/lib/python3.10/subprocess.py", line 503, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/local/lib/python3.10/subprocess.py", line 971, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/lib/python3.10/subprocess.py", line 1863, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'file'

Additional context Add any other context about the problem here.

gndctl-mehul avatar Feb 28 '24 21:02 gndctl-mehul