ocrd_tesserocr
ocrd_tesserocr copied to clipboard
ocrd_tesserocr processors waste CPU performance because of numpy blas threads
The current code imports numpy although it only uses a single function from that library. Including numpy creates a number of threads for the BLAS algorithms by default. Those threads use a lot of CPU time without doing anything useful.
Setting the environment variable OMP_THREAD_LIMIT=1 avoids those additional threads.
Maybe there exists a better solution which does not require an environment variable, for example removing the numpy requirement.
The current code imports numpy although it only uses a single function from that library.
I can only see np.round in ocrd-tesserocr-segment-region, and only under very rare circumstances.
Including numpy creates a number of threads for the BLAS algorithms by default. Those threads use a lot of CPU time without doing anything useful.
Are you saying a function that does not even get called most of the time is consuming CPU time because of some multi-threaded library? How is that? Did you measure or bisect that?
Setting the environment variable
OMP_THREAD_LIMIT=1avoids those additional threads.
That's what workflow-configuration is doing whenever you run with multiple jobs.
@bertsky, it's not the function - it's the import statement which starts the threads which burn the CPU time.
it's not the function - it's the import statement which starts the threads which burn the CPU time.
Did you cross-check that (deactivating the import statement and measuring again)?
(I have a hard time believing an unused module/function can burn CPU time.)
You are right. The function is used for some pages, but even after removing the import statement and the function call there remain 3 threads which use CPU time in my test. One is producing OCR. In GDB I see 6 threads (my CPU supports 6 threads), 5 of them looking like this:
(gdb) thr 7
[Switching to thread 7 (Thread 0x7fffe78c8700 (LWP 521057))]
#0 0x00007ffff7d54067 in sched_yield () at ../sysdeps/unix/syscall-template.S:120
120 ../sysdeps/unix/syscall-template.S: Datei oder Verzeichnis nicht gefunden.
(gdb) i s
#0 0x00007ffff7d54067 in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1 0x00007fffefeda4f2 in blas_thread_server ()
from /venv-20201001/lib/python3.7/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
#2 0x00007ffff7f8bea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#3 0x00007ffff7d6ceaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
So the problem remains, but my assumption what might be the reason was wrong.
I now checked thread creation in gdb. Even after removing the numpy code from segment_region.py there still remains a numpy which starts 5 blas_thread_server threads. But the process then creates lots of short living other threads, obviously triggered by shapely.
During execution I see 3 threads (always the same PIDs) using the CPU. By attaching gdb to one of them I could confirm that it is a blas_thread_server thread, so the subject of this issue is correct.
@stweil this OpenBLAS issue looks related to what you describe. But it has been fixed 5yrs ago. So I guess it is already deployed in most systems we use today. (I just learned you need to install libatlas3-base liblapack3 libopenblas-base to make numpy use these backends. Not sure about our Docker images... And I don't understand numpy.__config__.show() yet.)