unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Lightweight installation unstructured[pdf] ?????

Open liturrig opened this issue 1 year ago • 9 comments

Hello, Is there a way to install the library unstructured[pdf] in lightweight format just to use "fast" strategy without all other dependencies? Thank you in advance for your support.

liturrig avatar May 07 '24 09:05 liturrig

Hi @liturrig, unstructured does not currently have a "pdf-fast-only" install option.

Can you say a bit more about the your use case and why you want something like that?

scanny avatar May 07 '24 16:05 scanny

Why does it install nvidia libs ? When I added ["pdf"] docker image size increased to 6GB from 600MB before. That's insane.

mszpulak avatar May 08 '24 13:05 mszpulak

Why does it install nvidia libs ? When I added ["pdf"] docker image size increased to 6GB from 600MB before. That's insane.

Thats probably one of the biggest the reason why they created their own API. Our project' size is really big as well.

NathanAP avatar May 08 '24 19:05 NathanAP

I understand that the reason for the massive increase in the dependencies size is something like, extracting-text-from-images requires unstructured-inference which requires torch which requires nvidia.

For those of us that do not want to extract text from images in pdfs it would be very helpful not to have to have these huge dependencies.

https://github.com/Unstructured-IO/unstructured/blob/main/requirements/extra-pdf-image.in

is this a duplicate of https://github.com/Unstructured-IO/unstructured/issues/3326?

gecBurton avatar Jul 17 '24 11:07 gecBurton

@liturrig - not in a straightforward way, but yes. If you're using "fast" for partition_pdf you only need ["pdf2image", "pdfminer", "PIL"] (you can explore here )

So the way to improve the size of the module is:

  • install only unstructured module -> pip install unstructured
  • Do not install any extras like unstructured[pdf] because this by default automatically pulls everything from requirements
  • When partitioning call for from unstructured.partition.auto import partition which will automatically recognize .pdf files but not require google-cloud-vision or effdet which are the main size monsters if your strategy is set to "fast"

Keep in mind that you might need some extra packages from the pdf requirements (linked above) but these are all reasonable in size. effdet alone installs several Nvidia modules that hog up space.

@scanny - unstructured-inference that contains effdet @requires_dependencies("unstructured_inference") link is only used with "hi_res" strategy (albeit this is the default one). So anyone using unstructured to fast partition pdfs loads a lot of models they never use (even the case where you CPU-only).

belmmostest avatar Jul 25 '24 10:07 belmmostest

image

In my project I have these, which of them should I keep?

NathanAP avatar Jul 25 '24 13:07 NathanAP

If you're using "fast" strategy you can do without unstructured-inference. Depending on your project though this might break some things since inferences carries quite a few dependencies, see here - of which layoutparser and timm are the ones that bring in a lot of unnecessary models for "fast" strategy.

Looking at this I'm not sure how you ended up with the inference package in the first place, @NathanAP ... docx extras don't bring them in. Unless you added unstructured[pdf] at some point. My suggestion is, just pip install unstructured (or add to .toml) and then add any additional dependencies (like python-docx) separately to your project (best for slimming down the image size).

Also important - it does matter which version of unstructured you use. Older version <0.12 I think would break without inference. One I tested on is 0.14.0

belmmostest avatar Jul 25 '24 14:07 belmmostest

You should not be needing the libs coming with torch for gpu support without having one. Try installing torch before unstructured libraries like this.

`

-f https://download.pytorch.org/whl/torch_stable.html torch==2.3.0+cpu

` Should be above unstructured libs in requirements.txt Had saved me around 2 gb image size when used with lambdas

sidatcd avatar Jul 25 '24 14:07 sidatcd

I had PyTorch 2.4 installed but pip install unstructured[pdf] tries to install torch==2.0.1. I can't work out why. The dependencies don't require this, yet this is what pip outputs during install.

Collecting torch (from unstructured-inference==0.7.36->unstructured[pdf])
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/8c/4d/17e07377c9c3d1a0c4eb3fde1c7c16b5a0ce6133ddbabc08ceef6b7f2645/torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl.metadata
  Downloading torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)

Then it uninstalls the newer version:

Installing collected packages: triton, torch
  Attempting uninstall: triton
    Found existing installation: triton 3.0.0
    Uninstalling triton-3.0.0:
      Removing file or directory /home/davidg/.virtualenvs/learning/bin/proton
      Removing file or directory /home/davidg/.virtualenvs/learning/bin/proton-viewer
      Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/triton-3.0.0.dist-info/
      Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/triton/
      Successfully uninstalled triton-3.0.0
  Attempting uninstall: torch
    Found existing installation: torch 2.4.1
    Uninstalling torch-2.4.1:
      Removing file or directory /home/davidg/.virtualenvs/learning/bin/convert-caffe2-to-onnx
      Removing file or directory /home/davidg/.virtualenvs/learning/bin/convert-onnx-to-caffe2
      Removing file or directory /home/davidg/.virtualenvs/learning/bin/torchrun
      Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/functorch/
      Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/torch-2.4.1.dist-info/
      Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/torch/
      Removing file or directory /home/davidg/.virtualenvs/learning/lib/python3.10/site-packages/torchgen/
      Successfully uninstalled torch-2.4.1

Which breaks a bunch of other packages I have installed:

lightning 2.4.0 requires torch<4.0,>=2.1.0, but you have torch 2.0.1 which is incompatible.
lightning-flash 0.8.2 requires pytorch-lightning<2.0.0,>1.8.0, but you have pytorch-lightning 2.0.7 which is incompatible.
torchaudio 0.13.1+cu116 requires torch==1.13.1, but you have torch 2.0.1 which is incompatible.

So the installation fails.

I really just want to try out the package. Is the best option the docker image, or just test things with the serverless API?

Lemme know if this should be a new issue.

davidgilbertson avatar Sep 10 '24 01:09 davidgilbertson