unstructured
unstructured copied to clipboard
Lightweight installation unstructured[pdf] ?????
Hello, Is there a way to install the library unstructured[pdf] in lightweight format just to use "fast" strategy without all other dependencies? Thank you in advance for your support.
Hi @liturrig, unstructured does not currently have a "pdf-fast-only" install option.
Can you say a bit more about the your use case and why you want something like that?
Why does it install nvidia libs ? When I added ["pdf"] docker image size increased to 6GB from 600MB before. That's insane.
Why does it install nvidia libs ? When I added ["pdf"] docker image size increased to 6GB from 600MB before. That's insane.
Thats probably one of the biggest the reason why they created their own API. Our project' size is really big as well.
I understand that the reason for the massive increase in the dependencies size is something like, extracting-text-from-images requires unstructured-inference which requires torch which requires nvidia.
For those of us that do not want to extract text from images in pdfs it would be very helpful not to have to have these huge dependencies.
https://github.com/Unstructured-IO/unstructured/blob/main/requirements/extra-pdf-image.in
is this a duplicate of https://github.com/Unstructured-IO/unstructured/issues/3326?
@liturrig - not in a straightforward way, but yes. If you're using "fast" for partition_pdf you only need ["pdf2image", "pdfminer", "PIL"] (you can explore here )
So the way to improve the size of the module is:
- install only
unstructuredmodule ->pip install unstructured - Do not install any extras like
unstructured[pdf]because this by default automatically pulls everything from requirements - When partitioning call for
from unstructured.partition.auto import partitionwhich will automatically recognize .pdf files but not requiregoogle-cloud-visionoreffdetwhich are the main size monsters if your strategy is set to "fast"
Keep in mind that you might need some extra packages from the pdf requirements (linked above) but these are all reasonable in size. effdet alone installs several Nvidia modules that hog up space.
@scanny - unstructured-inference that contains effdet @requires_dependencies("unstructured_inference") link is only used with "hi_res" strategy (albeit this is the default one). So anyone using unstructured to fast partition pdfs loads a lot of models they never use (even the case where you CPU-only).
In my project I have these, which of them should I keep?
If you're using "fast" strategy you can do without unstructured-inference. Depending on your project though this might break some things since inferences carries quite a few dependencies, see here - of which layoutparser and timm are the ones that bring in a lot of unnecessary models for "fast" strategy.
Looking at this I'm not sure how you ended up with the inference package in the first place, @NathanAP ... docx extras don't bring them in. Unless you added unstructured[pdf] at some point. My suggestion is, just pip install unstructured (or add to .toml) and then add any additional dependencies (like python-docx) separately to your project (best for slimming down the image size).
Also important - it does matter which version of unstructured you use. Older version <0.12 I think would break without inference. One I tested on is 0.14.0
You should not be needing the libs coming with torch for gpu support without having one. Try installing torch before unstructured libraries like this.
`
-f https://download.pytorch.org/whl/torch_stable.html torch==2.3.0+cpu
` Should be above unstructured libs in requirements.txt Had saved me around 2 gb image size when used with lambdas
I had PyTorch 2.4 installed but pip install unstructured[pdf] tries to install torch==2.0.1. I can't work out why. The dependencies don't require this, yet this is what pip outputs during install.
Collecting torch (from unstructured-inference==0.7.36->unstructured[pdf])
Obtaining dependency information for torch from https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl.metadata
Downloading torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Then it uninstalls the newer version:
Installing collected packages: triton, torch
Attempting uninstall: triton
Found existing installation: triton 3.0.0
Uninstalling triton-3.0.0:
Removing file or directory /home/davidg/.virtualenvs/learning/bin/proton
Removing file or directory /home/davidg/.virtualenvs/learning/bin/proton-viewer
Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/triton-3.0.0.dist-info/
Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/triton/
Successfully uninstalled triton-3.0.0
Attempting uninstall: torch
Found existing installation: torch 2.4.1
Uninstalling torch-2.4.1:
Removing file or directory /home/davidg/.virtualenvs/learning/bin/convert-caffe2-to-onnx
Removing file or directory /home/davidg/.virtualenvs/learning/bin/convert-onnx-to-caffe2
Removing file or directory /home/davidg/.virtualenvs/learning/bin/torchrun
Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/functorch/
Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/torch-2.4.1.dist-info/
Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/torch/
Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/torchgen/
Successfully uninstalled torch-2.4.1
Which breaks a bunch of other packages I have installed:
lightning 2.4.0 requires torch<4.0,>=2.1.0, but you have torch 2.0.1 which is incompatible.
lightning-flash 0.8.2 requires pytorch-lightning<2.0.0,>1.8.0, but you have pytorch-lightning 2.0.7 which is incompatible.
torchaudio 0.13.1+cu116 requires torch==1.13.1, but you have torch 2.0.1 which is incompatible.
So the installation fails.
I really just want to try out the package. Is the best option the docker image, or just test things with the serverless API?
Lemme know if this should be a new issue.