unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

feat/Allow PDF partitioning without unstructured_inference

Open flash1293 opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. Up until unstructured 0.10.27 it was possible to use the fast and ocr_only strategy without having unstructured_inference installed (which pulls in a lot of transitive dependencies). However, starting from 0.10.28 there is a hard dependency on unstructured_inference for PDF partitioning in two ways:

Top level import of unstructured.partition.ocr which in turn has a top level import from unstructured_inference: https://github.com/Unstructured-IO/unstructured/blob/2931cb38e8a5159e9c790a314b848c5c3ff58bb4/unstructured/partition/pdf.py#L76

This makes it impossible to use pdf partitioning without having unstructured_inference installed as importing from unstructured.partition.pdf will fail.

For OCR partitioning, there is another explicit check in place to require unstructured_inference: https://github.com/Unstructured-IO/unstructured/blob/2931cb38e8a5159e9c790a314b848c5c3ff58bb4/unstructured/partition/pdf.py#L324

Describe the solution you'd like

Ideally, both fast and ocr_only partitioning are possible without having to install all of unstructured_inference including transitive dependencies, basically the state of 0.10.27. This can be done by guarding all imports with explicit checks in various places.

Describe alternatives you've considered

  • Installing unstructured_inference. In my environment, the application using unstructured is packaged in a docker image - adding the unstructured_inference dependency increases the size of the docker image by more than 3GB which makes distribution difficult.
  • Restoring fast partitioning by avoiding top-level imports from unstructured.partition.ocr in unstructured.partition.pdf for the code path of the fast strategy. While this restores basic functionality, it reduces the number of parseable PDFs considerably.

Additional context

Happy to provide a PR if you agree with this being a useful feature.

flash1293 avatar Nov 20 '23 17:11 flash1293

I don't know what version fixed this, but I think it's fixed as of 0.12.5: I can partition PDFs again without having unstructured_inference installed.

artdent avatar Mar 06 '24 17:03 artdent

Yeah we refactored the imports and and dependencies a bit and I don't think this is still and issue. We can reopen the issue if it pops up again.

MthwRobinson avatar May 17 '24 12:05 MthwRobinson