ocrd_tesserocr icon indicating copy to clipboard operation
ocrd_tesserocr copied to clipboard

ocrd_tesserocr processors waste CPU performance because of numpy blas threads

Open stweil opened this issue 5 years ago • 6 comments

The current code imports numpy although it only uses a single function from that library. Including numpy creates a number of threads for the BLAS algorithms by default. Those threads use a lot of CPU time without doing anything useful.

Setting the environment variable OMP_THREAD_LIMIT=1 avoids those additional threads.

Maybe there exists a better solution which does not require an environment variable, for example removing the numpy requirement.

stweil avatar Oct 03 '20 21:10 stweil

The current code imports numpy although it only uses a single function from that library.

I can only see np.round in ocrd-tesserocr-segment-region, and only under very rare circumstances.

Including numpy creates a number of threads for the BLAS algorithms by default. Those threads use a lot of CPU time without doing anything useful.

Are you saying a function that does not even get called most of the time is consuming CPU time because of some multi-threaded library? How is that? Did you measure or bisect that?

Setting the environment variable OMP_THREAD_LIMIT=1 avoids those additional threads.

That's what workflow-configuration is doing whenever you run with multiple jobs.

bertsky avatar Oct 04 '20 20:10 bertsky

@bertsky, it's not the function - it's the import statement which starts the threads which burn the CPU time.

stweil avatar Oct 04 '20 20:10 stweil

it's not the function - it's the import statement which starts the threads which burn the CPU time.

Did you cross-check that (deactivating the import statement and measuring again)?

(I have a hard time believing an unused module/function can burn CPU time.)

bertsky avatar Oct 04 '20 20:10 bertsky

You are right. The function is used for some pages, but even after removing the import statement and the function call there remain 3 threads which use CPU time in my test. One is producing OCR. In GDB I see 6 threads (my CPU supports 6 threads), 5 of them looking like this:

(gdb) thr 7
[Switching to thread 7 (Thread 0x7fffe78c8700 (LWP 521057))]
#0  0x00007ffff7d54067 in sched_yield () at ../sysdeps/unix/syscall-template.S:120
120	../sysdeps/unix/syscall-template.S: Datei oder Verzeichnis nicht gefunden.
(gdb) i s
#0  0x00007ffff7d54067 in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007fffefeda4f2 in blas_thread_server ()
   from /venv-20201001/lib/python3.7/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#2  0x00007ffff7f8bea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#3  0x00007ffff7d6ceaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

So the problem remains, but my assumption what might be the reason was wrong.

stweil avatar Oct 05 '20 05:10 stweil

I now checked thread creation in gdb. Even after removing the numpy code from segment_region.py there still remains a numpy which starts 5 blas_thread_server threads. But the process then creates lots of short living other threads, obviously triggered by shapely.

During execution I see 3 threads (always the same PIDs) using the CPU. By attaching gdb to one of them I could confirm that it is a blas_thread_server thread, so the subject of this issue is correct.

stweil avatar Oct 05 '20 05:10 stweil

@stweil this OpenBLAS issue looks related to what you describe. But it has been fixed 5yrs ago. So I guess it is already deployed in most systems we use today. (I just learned you need to install libatlas3-base liblapack3 libopenblas-base to make numpy use these backends. Not sure about our Docker images... And I don't understand numpy.__config__.show() yet.)

bertsky avatar Feb 12 '21 18:02 bertsky