paperless Tesseract OCR slow after upgrade to 4.0.0

Today I updated my paperless install's (in a virtualenv on Arch Linux) version of tesseract from 3.05.01-6 -> 4.0.0-1. After this, OCRing of a single page is taking an extremely long time (i.e. ./manage.py test paperless_tesseract.tests.test_date.TestDate.test_get_text_1_png took almost ten minutes at 574.793s).

Downgrading back to 3.05.01-6 allowed the test to run in 35.145s. Has anyone else seen such a drastic change in performance, and have any suggestions?

EDIT: I suppose this is the danger of running something like Arch, since it appears the 4.X version of Tesseract was released only 2 weeks ago :smirk:

Nov 16 '18 03:11 jat255

If the machine running Tesseract does not have AVX2, Tesseract 4 will be much slower than 3.x.

This is especially true for non-x64 processors.

Jan 17 '19 09:01 jbarlow83

Thanks for the insight, @jbarlow83!

Based off of https://github.com/jbarlow83/OCRmyPDF/issues/217, I investigated upgrading to v4 of Tesseract, but using the environment variable to effectively disable OpenMP:

Results are based off of running ./manage.py test paperless_tesseract.tests.test_date.TestDate.test_get_text_1_png 3 times and taking an average (using tesseract as packaged in the Arch repo):

`OMP_THREAD_LIMIT` value	Tesseract version	Test runtime + std dev.
`unset OMP_THREAD_LIMIT`	3.05.01-6	(38.6 ± 2.4) s
`export OMP_THREAD_LIMIT=1`	3.05.01-6	(39.9 ± 2.4) s
`unset OMP_THREAD_LIMIT`	4.0.0-1	613.8 s (only run once)
`export OMP_THREAD_LIMIT=1`	4.0.0-1	(75.1 ± 3.6) s

So the updated version of Tesseract is slower (almost 2x so), but not unreasonable in my opinion. For an installation such as mine (running in a VM on a relatively under-powered dual core NAS), it would probably be wise to set this environment variable by default.

Jan 17 '19 17:01 jat255

I'll note that the above was with PAPERLESS_OCR_THREADS=2. Setting this to 1 (to hopefully disable multithreading) without the OMP did not help, taking about 10 minutes again.

Jan 17 '19 17:01 jat255

Hey @jat255, is this issue "fixed" now? As in, is there anything that we can/should do to Paperless to make it work for your case without gumming up more common use-cases?

Feb 10 '19 17:02 danielquinn

Thanks for following up on this. I think it would be wise to set the thread limit environment variable by default in the configuration file, although this might anger anyone running paperless on largely parallel system (which I am not). What do you think?

Feb 10 '19 18:02 jat255

I think both can be handled fairly easily actually.

You can get the number of processes available in python with this:

import multiprocessing

multiprocessing.cpu_count()

So then you can use the output of that as the default, and use whatever the user has specified (if anything) instead with something like this in settings.py:

import multiprocessing

OCR_THREADS = os.getenv("PAPERLESS_OCR_THREADS", multiprocessing.cpu_count())

This would have to be documented though. We can't just add stuff and let it be a surprise for other users :-)

Sound good?

Feb 10 '19 18:02 danielquinn

That's probably a good idea. The other thing that @jbarlow83 mentioned as making a difference is processor support for AVX2. My old Turion processor doesn't have that, which is probably why it's so slow. I'm not sure how to effectively test for that using Python. In Linux, you can test with lscpu and look for avx2 in the "Flags" section, but I'm not sure what the requirement would be for people running paperless on Windows or Mac/BSD.

Note that we have to do more than just set OCR_THREADS correctly, because we need to set the OpenMP environment vars for Tesseract.

I would suggest that paperless takes an opinionated stance and either assumes AVX2 is unavailable (or vice versa), and set the proper environment variables as needed. We could add to the docs that it is up to the user to swap that flag in paperless.conf to match their system's capabilities.

I would suggest adding to paperless.conf:

# By default, Paperless will assume the processor does not support the newer AVX2
# instruction set (see https://en.wikipedia.org/wiki/Advanced_Vector_Extensions).
# Modern versions of tesseract (4.0+) will perform much faster OCR when the processor
# supports AVX2, but will perform very poorly without them (unless certain settings are changed)
# Setting this value to match your system's configuration will optimize the performance of tesseract.
# Uncomment below and set the value to "true" to allow Paperless to use the updated features.
# (To see if your processor supports AVX2 on Linux, run `lscpu | grep avx2`; if nothing is returned, AVX2
# is not supported)
# PAPERLESS_AVX2_AVAILABLE="false"

And then something like this in the setup code:

USE_AVX2 = __get_boolean("PAPERLESS_AVX2_AVAILABLE", "false")
if USE_AVX2:
    os.environ["OMP_THREAD_LIMIT"] = str(OCR_THREADS)
else:
    os.environ["OMP_THREAD_LIMIT"] = "1"

Thoughts?

Feb 10 '19 18:02 jat255

This would force tesseract to be single-threaded in OpenMP when those instructions aren't available, which I think should provide the best experience for users by default.

Feb 10 '19 19:02 jat255

I want to stress there are two independent reasons Tesseract 4 can be slow compared to 3:

Multiprocessing parallel instances of Tesseract 4 without setting OMP_THREAD_LIMIT=1 (leads to too many processes fighting over CPU time)
Running Tesseract 4 on a processor without AVX2

I recommend that anyone running parallel instances Tesseract should just set os.environ["OMP_THREAD_LIMIT"] = "1". Then use the OCR_THREADS as in the past to limit the number of Tesseract processes/threads that run. Tesseract 3 doesn't care about the parameter so there's no need to test the version. There is no benefit to using OpenMP if you are already parallelizing processes, because the opportunities for parallelism available for OpenMP are more limited compared to parallelizing whole Tesseract processes. This is conclusion is the result of performance testing I did for a client that did a large scale cloud deployment of OCRmyPDF. (Footnote: I suspect some atypical workloads like small or large images would be exceptions but I haven't tested that. Also if you were more concerned about latency than throughput. Needless to say, the atypical workload is not the sort of thing people use paperless for. I should mention OpenMP was not a wasted effort for Tesseract – it benefits users of the command line program in particular.)

There's no point to testing for the availability of AVX2 unless you want paperless to warn users about OCR performance compared to Tesseract 3. Nothing can be done if AVX2 is not available; the user just has to accept OCR will take longer than Tesseract 3.

Feb 10 '19 23:02 jbarlow83

I just ran into this issue setting up paperless with the docker-compose flow. Maybe we could add this flag to the template docker-compose.env file?

Jun 03 '20 01:06 frrad