docspell icon indicating copy to clipboard operation
docspell copied to clipboard

File uploads which are large (up to 8Gb)

Open intra-aud opened this issue 3 years ago • 3 comments

Hi @eikek ,

I am rather impressed with Docspell and have been playing around with the tool for a little while now, i intend to use Docspell with rather large PDF files (small ediscovery task) and seem to run into a bug which was mention previously around large uploads of files.

Would you be so kind as to updating me on where to find a config attribute that controls max size upload or is this still in development?

intra-aud avatar Jul 07 '22 12:07 intra-aud

Use these settings and change them accordingly:

Images greater than this size are skipped. Note that every image is loaded completely into memory for doing OCR. This is the pixel count, height * width of the image.

DOCSPELL_JOEX_CONVERT_MAX__IMAGE__SIZE=14000000 DOCSPELL_JOEX_EXTRACTION_OCR_MAX__IMAGE__SIZE=14000000

You may also need to increase the timeout(s) by a lot! See: https://docspell.org/docs/configure/defaults/

8GB is a lot (for 1 file) You'd need at least 8GB of RAM for processing a single file. Make sure you have enough resources by also increasing the JAVA heap size.

Snify89 avatar Jul 07 '22 12:07 Snify89

Additionally to what @Snify89 said: you could disable some processing to save resources if you want. For example, running ocrmypdf is not required, when you have pdfs in the first place. Docspell is not really prepared for files that large, tbh. But otoh with enough resources, it should work. Never tried myself though :) Happy to hear how it goes!

eikek avatar Jul 07 '22 12:07 eikek

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. This only applies to 'question' issues. Always feel free to reopen or create new issues. Thank you!

stale[bot] avatar Aug 10 '22 02:08 stale[bot]