unstructured
unstructured copied to clipboard
feat/Allow PDF partitioning without unstructured_inference
Is your feature request related to a problem? Please describe.
Up until unstructured 0.10.27
it was possible to use the fast
and ocr_only
strategy without having unstructured_inference
installed (which pulls in a lot of transitive dependencies). However, starting from 0.10.28
there is a hard dependency on unstructured_inference
for PDF partitioning in two ways:
Top level import of unstructured.partition.ocr
which in turn has a top level import from unstructured_inference
: https://github.com/Unstructured-IO/unstructured/blob/2931cb38e8a5159e9c790a314b848c5c3ff58bb4/unstructured/partition/pdf.py#L76
This makes it impossible to use pdf partitioning without having unstructured_inference installed as importing from unstructured.partition.pdf
will fail.
For OCR partitioning, there is another explicit check in place to require unstructured_inference
: https://github.com/Unstructured-IO/unstructured/blob/2931cb38e8a5159e9c790a314b848c5c3ff58bb4/unstructured/partition/pdf.py#L324
Describe the solution you'd like
Ideally, both fast
and ocr_only
partitioning are possible without having to install all of unstructured_inference
including transitive dependencies, basically the state of 0.10.27
. This can be done by guarding all imports with explicit checks in various places.
Describe alternatives you've considered
- Installing
unstructured_inference
. In my environment, the application using unstructured is packaged in a docker image - adding theunstructured_inference
dependency increases the size of the docker image by more than 3GB which makes distribution difficult. - Restoring
fast
partitioning by avoiding top-level imports fromunstructured.partition.ocr
inunstructured.partition.pdf
for the code path of thefast
strategy. While this restores basic functionality, it reduces the number of parseable PDFs considerably.
Additional context
Happy to provide a PR if you agree with this being a useful feature.
I don't know what version fixed this, but I think it's fixed as of 0.12.5: I can partition PDFs again without having unstructured_inference installed.
Yeah we refactored the imports and and dependencies a bit and I don't think this is still and issue. We can reopen the issue if it pops up again.