OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Hanging on Random Files

Open jgforbes opened this issue 2 years ago • 23 comments

Running find . -printf '%p' -maxdepth 1 -iname "*.pdf" -exec ocrmypdf -v1 '{}' '{}' ; on a directory with many pdf files, most of which have already have OCR data, ocrmypdf hangs on random files. Hitting ctrl-c gives the traceback shown below.

./test file.pdfocrmypdf 11.5.0 Running: ['tesseract', '--list-langs'] stdout/stderr = List of available languages (1): eng

Running: ['tesseract', '--version'] Found tesseract 4.1.1 Running: ['tesseract', '-l', 'eng', '--print-parameters', 'pdf'] Running: ['gs', '--version'] Found gs 9.53.3 pikepdf mmap disabled os.symlink(./test file.pdf, /tmp/ocrmypdf.io.n_8kmlq6/origin) os.symlink(/tmp/ocrmypdf.io.n_8kmlq6/origin, /tmp/ocrmypdf.io.n_8kmlq6/origin.pdf) pikepdf mmap disabled
pikepdf mmap disabled
pikepdf mmap disabled
pikepdf mmap disabled
Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 716/716 [00:20<00:00, 170.00page/s] ^C Scanning contents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 716/716 [21:11<00:00, 1.78s/page] [jgf@bigdog testDir]$ KeyboardInterrupt Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 359, in run_pipeline pdfinfo = get_pdfinfo( File "/usr/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 155, in get_pdfinfo return PdfInfo( File "/usr/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 833, in init self._pages = _pdf_pageinfo_concurrent( File "/usr/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 619, in _pdf_pageinfo_concurrent exec_progress_pool( File "/usr/lib/python3.9/site-packages/ocrmypdf/_concurrent.py", line 135, in exec_progress_pool pool.join() File "/usr/lib64/python3.9/multiprocessing/pool.py", line 666, in join p.join() File "/usr/lib64/python3.9/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) File "/usr/lib64/python3.9/multiprocessing/popen_fork.py", line 43, in wait return self.poll(os.WNOHANG if timeout == 0.0 else 0) File "/usr/lib64/python3.9/multiprocessing/popen_fork.py", line 27, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt

Describe the bug** It seems that when the files are on the NVM M.2, the testing/conversion stops at random files. When they are on the spinning disk, all of the files are analysed. Could it be a race condition with the high speed of the the NVM M.2 storage?

Expected behavior All files will be completed

System

  • OS: Fedora 33 with Spinning disk, ssd, network and nvm storage
  • OCRmyPDF Version: 11.5.0
  • Installed via rpm package

jgforbes avatar Aug 03 '21 04:08 jgforbes

There is a deadlock that affected the v11 releases. (Logging + multiprocessing related, it turns out.)

Chances are, upgrading to latest release will fix it. Please try that before we go any further.

jbarlow83 avatar Aug 03 '21 07:08 jbarlow83

Version 12.3.1 works correctly with the NVM storage. The verbose -v1 mode shows a traceback when the PriorOcrFoundError is emitted but it does not hang up. It also works consistently with the network storage as well.

jgforbes avatar Aug 03 '21 13:08 jgforbes

Well the new version does seem to have some hangs, but much less frequently. Here is a traceback after hitting ctrl-c. Not much there. I will see if I can get a hang with -v1. PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr ^CException ignored in: <module 'threading' from '/usr/lib64/python3.9/threading.py'> Traceback (most recent call last): File "/usr/lib64/python3.9/threading.py", line 1448, in _shutdown

lock.acquire()

KeyboardInterrupt:

jgforbes avatar Aug 04 '21 02:08 jgforbes

Test File was not the original name. It was a very long path/ I will see if it stops at other files.

./Test File.pdfocrmypdf 12.3.1
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (1):
eng                                                                                                                                                                              

Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
os.symlink(./Test File.pdf, /tmp/ocrmypdf.io.kxy_x8ii/origin)                                      
os.symlink(/tmp/ocrmypdf.io.kxy_x8ii/origin, /tmp/ocrmypdf.io.kxy_x8ii/origin.pdf)
Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 975/975 [00:02<00:00, 359.16page/s]
Using Tesseract OpenMP thread limit 1
Start processing 4 pages concurrently
    6 Rasterize with pngmono, rotation 0                                                                                                                                         
OCR:   0%|                                                                                                                                         | 0.0/975.0 [00:00<?, ?page/s]
ExitCodeException
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 172, in exec_page_sync
    if not is_ocr_required(page_context):
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_pipeline.py", line 271, in is_ocr_required
    raise PriorOcrFoundError(
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 374, in run_pipeline
    exec_concurrent(context, executor)
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent
    executor(
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_concurrent.py", line 82, in __call__
    self._execute(
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 132, in _execute
    for result in results:
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr

jgforbes avatar Aug 04 '21 02:08 jgforbes

ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr

This is exception is expected for files that already have ocr.

jbarlow83 avatar Aug 04 '21 03:08 jbarlow83

Yes, but the processing of the files stopped after that exception. Since the ocrmypdf command was in the -exec of a find statement, ctrl-c stopped the processing. I am now using a script and it has stopped at least one file so far. After waiting some minutes, ctrl-c restarts the processing. Maybe this is file content issue and not a multiprocessing issue. I will try to send more information.

jgforbes avatar Aug 04 '21 13:08 jgforbes

Below is an example of ocrmypdf getting stuck on a file that had prior ocr text. I hit ctrl-c twice to get it stopped and the script continued. It had been stuck for 18 min.

ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr

real    0m2.391s
user    0m4.369s
sys     0m0.367s

File #436 : ./Test File 436.pdf
ocrmypdf 12.3.1
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (1):
eng                                                                                                                                                                              

Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
os.symlink(./Test File 436.pdf, /tmp/ocrmypdf.io.4h6quh69/origin)                                                                                                                                     
os.symlink(/tmp/ocrmypdf.io.4h6quh69/origin, /tmp/ocrmypdf.io.4h6quh69/origin.pdf)
Scanning contents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 218/218 [00:02<00:00, 91.71page/s]
Using Tesseract OpenMP thread limit 1
Start processing 4 pages concurrently
    1 Rasterize with png16m, rotation 0                                                                                                                                          
OCR:   0%|                                                                                                                                         | 0.0/218.0 [00:00<?, ?page/s]
ExitCodeException
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 172, in exec_page_sync
    if not is_ocr_required(page_context):
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_pipeline.py", line 271, in is_ocr_required
    raise PriorOcrFoundError(
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 374, in run_pipeline
    exec_concurrent(context, executor)
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent
    executor(
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_concurrent.py", line 82, in __call__
    self._execute(
  File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 132, in _execute
    for result in results:
  File "/usr/lib64/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;  see also help for the arguments --skip-text and --redo-ocr
^CException ignored in: <module 'threading' from '/usr/lib64/python3.9/threading.py'>
Traceback (most recent call last):
  File "/usr/lib64/python3.9/threading.py", line 1448, in _shutdown
    lock.acquire()
KeyboardInterrupt: 

^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib64/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/usr/lib64/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/lib64/python3.9/multiprocessing/queues.py", line 201, in _finalize_join
    thread.join()
  File "/usr/lib64/python3.9/threading.py", line 1053, in join
    self._wait_for_tstate_lock()
  File "/usr/lib64/python3.9/threading.py", line 1069, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

real    18m53.022s
user    0m10.039s
sys     0m0.349s

jgforbes avatar Aug 04 '21 15:08 jgforbes

@jgforbes I think I've run into the same issue. Weirdly, I got it to run after canceling the command then changing any of the command line arguments (adding or removing a lang from -l for example). Did you observe that too or can you check?

stargazing-dino avatar Aug 04 '21 21:08 stargazing-dino

It only seems to happen when processing many files, and if they are reprocessed, it happens on different files. If I run ocrmypdf on the files where is stopped previously, it completes. Having a command line switch or not does not seem to have an effect. It also happens when the input files are on a spinning disk, so the storage speed is not the issue. I am running the latest version from a python virtual environment. Could that have an effect? Changing the command line switches as you suggest still results in hung processes.

jgforbes avatar Aug 05 '21 17:08 jgforbes

A few things to help with debugging/reproducing if either of you are able:

  1. Use gdb to inspect the deadlock

In Fedora try: sudo dnf install gdb python-debuginfo

Then wait for the process to hang. Attach to the process using gdb -p, and use py-bt to inspect the backtraces of the various worker processes. It may be helpful to try -j 2 or -j 4 to reduce the number of workers. We're looking for workers stuck in interesting places, like trying to acquire a lock. An IDE may be able to do this as well.

See https://devguide.python.org/gdb/

  1. Try the semaphore-free executor

ocrmypdf --plugin ocrmypdf.extra_plugins.semfree input.pdf output.pdf

This is less efficient but used in AWS Lambda and Android Termux, which don't have standard POSIX semaphores.

  1. Try threads instead of processes

ocrmypdf --use-threads ...

jbarlow83 avatar Aug 06 '21 08:08 jbarlow83

It only seems to happen when processing many files, and if they are reprocessed, it happens on different files. If I run ocrmypdf on the files where is stopped previously, it completes. Having a command line switch or not does not seem to have an effect. It also happens when the input files are on a spinning disk, so the storage speed is not the issue. I am running the latest version from a python virtual environment. Could that have an effect? Changing the command line switches as you suggest still results in hung processes. Running the containerized version on a different server on the same dataset still hangs, but with much less frequency.

I have performed the test with the semaphore-free executor and it completed the data set without hanging. Qualitatively the average time seems to be much less, but some took minutes to report that there was already OCR text.

I will try the debugger to see what I can find.

jgforbes avatar Aug 06 '21 15:08 jgforbes

There does not seem to be a python-debuginfo package for Fedora 33. The devguide link says When CPython is built you will notice a python-gdb.py file in the root directory of your checkout." The python-gdb.py file is needed to use gdb with python.

jgforbes avatar Aug 06 '21 17:08 jgforbes

OK, I got the python 3.9 debugging information. There is only one hung process

Attaching to process 163714
[New LWP 163720]
[New LWP 163780]
[New LWP 163794]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f135045c0d4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-4.fc33.x86_64 expat-2.4.1-1.fc33.x86_64 fribidi-devel-1.0.10-3.fc33.x86_64 giflib-5.2.1-6.fc33.x86_64 glibc-2.32-10.fc33.x86_64 jbigkit-libs-2.1-19.fc33.x86_64 leptonica-1.81.1-1.fc33.x86_64 libgcc-10.3.1-1.fc33.x86_64 libjpeg-turbo-2.0.5-5.fc33.x86_64 libpng-1.6.37-6.fc33.x86_64 libstdc++-10.3.1-1.fc33.x86_64 libtiff-4.1.0-8.fc33.x86_64 libuuid-2.36.1-1.fc33.x86_64 libwebp-1.2.0-1.fc33.x86_64 libzstd-1.5.0-1.fc33.x86_64 openssl-libs-1.1.1k-1.fc33.x86_64 xz-libs-5.2.5-4.fc33.x86_64 zlib-1.2.11-23.fc33.x86_64
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1448, in _shutdown
    lock.acquire()
(gdb) 

Which is the same traceback shown when ctrl-C is pressed.

jgforbes avatar Aug 06 '21 18:08 jgforbes

I think this happens if the Linux OOM killer kills one of the worker child processes.

jbarlow83 avatar Oct 28 '21 09:10 jbarlow83

This is probably/hopefully resolved as of v13.1. ocrmypdf now recognizes if one of its child workers has crashed or killed by OOM. Previously this kind of error handling only worked if a Python exception was thrown so certain errors would be missed.

jbarlow83 avatar Dec 19 '21 23:12 jbarlow83

I am writing to report that the problem described by OP has not been fixed in OCRmyPDF 13.3.0 (with Python 3.9.10). However, I also want to make a note that the problem can be avoided with the "--use-threads" argument as suggested by @jbarlow83 on Aug 6, 2021. (See also: https://github.com/ocrmypdf/OCRmyPDF/issues/800) I think it would be great if the problem can be fixed within OCRmyPDF. However, if fixing this problem is not a priority, I think the "--use-threads" argument should be added into the documentation. The argument is not currently mentioned in the documentation, and it is not ideal that the work-around is mentioned only in comments to posts about issues.

More specifically, the problem comes up when the argument "--skip-text" or "--redo-OCR" is used on a file that had prior OCR text. OCRmyPDF hangs on the OCR step in either case. It does not show up for a PDF file that is completely made of scanned images with no prior OCR. It does not make any sense why OCRmyPDF should hang on the OCR step when OCRmyPDF says it is going to "skip all processing" on every single page with the "--skip-text" argument, as seen in the following example, but it does.

$ ocrmypdf --skip-text good_PDF_2.pdf output.pdf Scanning contents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 180.08page/s] Start processing 15 pages concurrently 1 skipping all processing on this page 2 skipping all processing on this page 3 skipping all processing on this page 4 skipping all processing on this page 5 skipping all processing on this page 6 skipping all processing on this page 8 skipping all processing on this page 7 skipping all processing on this page 10 skipping all processing on this page 9 skipping all processing on this page 12 skipping all processing on this page 11 skipping all processing on this page 13 skipping all processing on this page 14 skipping all processing on this page 15 skipping all processing on this page OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.0/15.0 [00:00<00:00, 249.21page/s]OCR: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.0/15.0 [00:03<00:00, 4.63page/s] KeyboardInterrupt ^CError in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/----/miniconda3/envs/-----/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt

But, as already stated, the "--skip-text" argument is a workaround to the issue.

$ ocrmypdf --skip-text --use-threads good_PDF_2.pdf output.pdf Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 92.11page/s] Start processing 15 pages concurrently 1 skipping all processing on this page 2 skipping all processing on this page 3 skipping all processing on this page 4 skipping all processing on this page 5 skipping all processing on this page 6 skipping all processing on this page 7 skipping all processing on this page 8 skipping all processing on this page 9 skipping all processing on this page 10 skipping all processing on this page 11 skipping all processing on this page 12 skipping all processing on this page 13 skipping all processing on this page 14 skipping all processing on this page 15 skipping all processing on this page OCR: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.0/15.0 [00:00<00:00, 1121.17page/s] Postprocessing... PDF/A conversion: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 32.30page/s] Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata. JPEGs: 0image [00:00, ?image/s] JBIG2: 0item [00:00, ?item/s] Optimize ratio: 1.00 savings: 0.1% Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

fj7778 avatar Feb 07 '22 21:02 fj7778

However, if fixing this problem is not a priority,

I don't have a way to reproduce this so it's not that easy to investigate, not that it's the sort of issue that is easy to investigate in the first place. Python 3.10 may be worth a shot if available in the event it's a Python standard library issue.

--use-threads gives much slower performance.

jbarlow83 avatar Feb 07 '22 22:02 jbarlow83

@jbarlow83 Thanks! I will give Python 3.10 a try at a later time. It appears that OP and I and the poster of https://github.com/ocrmypdf/OCRmyPDF/issues/800 all use virtual environments. Maybe this is all a virtual environment

  • multiprocessing problem or possibly even something specific to Anaconda/Miniconda?

Anyway, slower is better than nonfunctional. It may be worth mentioning --use-threads in the documentation as a potential troubleshooting setting when OCRmyPDF is used on a range of PDF files whose scan/"born native" characteristics are not known ahead of time. That way, users will not question their sanity like I did when OCRmyPDF works for some files but not others.

For myself, OCRmyPDF is worth using even if I have to use it at a slower speed. I hope this problem can be fixed in time.

fj7778 avatar Feb 08 '22 03:02 fj7778

Just ran into this bug w/ ocrmypdf 13.4.0. Happens in both python 3.9 and 3.10. I have 12 cores on my machine; ocrmypdf completes successfully w/ -j 1 up to 7 or 8, but fails w/ -j 8 or 9 or more. (It also completes successfully if I add --use-threads, even w/, say, -j 12.)

The offending pdf is a 31-page federal court opinion. Possibly related: the file apparently already has OCR text, but search didn't work, hence my need to use ocrmypdf.

$ ocrmypdf federal.court.opinion.pdf federal.court.opinion2.pdf --redo-ocr -j 12

Scanning contents: 100%|| 31/31 [00:00<00:00, 72.11page/s]
Start processing 9 pages concurrently
    1 redoing OCR                                                                                             
    2 redoing OCR                                                                                             
...
   31 redoing OCR
OCR:  98%|| 30.5/31.0 [00:28<00:00,  1.64page/s]OCR: 100%|| 31.0/31.0 [06:26<00:00, 12.48s/page]
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/usr/lib/python3.10/threading.py'>
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1525, in _shutdown
    atexit_call()
  File "/usr/lib/python3.10/concurrent/futures/process.py", line 95, in _python_exit
    t.join()
  File "/usr/lib/python3.10/threading.py", line 1089, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.10/threading.py", line 1105, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt: 
^CException ignored in atexit callback: <function _exit_function at 0x7fe59433b9a0>
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
    p.join()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt:

edhalter avatar Feb 26 '22 04:02 edhalter

@edhalter Thank you for posting. I was the previous poster and would like this problem fixed. I am glad you added to this thread.

My earlier post was created using a computing cluster. I do not know how many cores there are on the cluster except that there are a lot of them.

I tested my original 15 page PDF on my cluster:

With no "-j" setting, it processes 15 pages concurrently but hangs. With no "-j" settings and "--use-threads", all 15 pages are processed concurrently and successfully. With "-j X" settings, X pages are processed concurrently. It succeeds with X <= 5, otherwise, it hangs.

Using your 31 page PDF file on my cluster:

With no "-j" setting, it processes 31 pages concurrently but hangs. With no "-j" settings and "--use-threads", all 31 pages are processed concurrently and successfully. With "-j X" settings, X pages are processed concurrently. It succeeds with X <= 5, otherwise, it hangs.

The results are essentially identical using my file and your file.

With my cluster I am able to process more pages than you can concurrently but my maximum successful "-j" setting is lower than yours, at 5 instead of 8.

So I can confirm that throttling using the "-j" flag can avoid the hang. However, this is clearly not a fix to the problem.

fj7778 avatar Feb 26 '22 15:02 fj7778

First, v13.1 does fix at least some versions of this issue, since a failed child process (which can be simulated easily) would definitely hang ocrmypdf and that is fixed now. What users have reported since is probably something different but the behavior is the same to a user so we'll deal with it here.

@edhalter 's file seems to be a consistent reproducer of the issue post-13.1 which I very much needed.

The stack trace everyone is posting seems to be https://bugs.python.org/issue45274 for which there are partial fixes that are not available in many people's CPython's yet. That stack trace is also likely unrelated to the issue. Rather when the issue occurs, a side effect is that Ctrl+C cannot be processed normally and the Python issue 45274 also occurs. It's collateral damage that gives no clue as to what actually happening. Meaning also, a debugger cannot step in normally, since debuggers send SIGINT which is the same as pressing Ctrl+C. In short, this has been difficult to explore.

jbarlow83 avatar Feb 27 '22 04:02 jbarlow83

Tests:

Explored the possibility that that tls.pageno - used to ensure page numbers are logged correctly in error messages - is involved since it involves thread local storage. Issue still occurs.

listener.join(timeout=5) - Issue still occurs.

--no-progress-bar - Issue still occurs. But seems less frequent.

-q - Issue not reproducible in 10 attempts.

Bypass logger.handle(record) (suppress all log messages send from child processes) - not reproducible

Replace log_listener with logging.handlers.QueueListener - Issue seems to occur almost 100%.

Remove with pool_lock - still occurs

Ensure listener.stop called after executor.exit with try/finally - still occurs (probably keep this change, seems better)

put sentinel before executor.exit - causes BrokenPipeError from somewhere

Explicit shutdown before executor.exit - defers deadlock to end

Getting pretty deep into Python internals.....

jbarlow83 avatar Feb 27 '22 22:02 jbarlow83

Note -j12 aggravates the issues; seems related to a worker process ending up without any work to do.

Try TqdmConsole lock around write, flush - deadlocks

Try with logging_redirect_tqdm(): result = run_pipeline(...) and disable TqdmConsole, letting tqdm API handle messaging - deadlocks

Try only attempt listener.join() if executor raises no exception - deadlocks

Try q.get(timeout=2) - deadlocks

Try q.get(timeout=2) and break on empty queue - deadlocks (empty queue condition occurs repeatedly)

jbarlow83 avatar Mar 13 '22 08:03 jbarlow83

I believe this is an instance of a very general multiprocessing issue in Python, that forking a process that has created threads can create a deadlock. OCRmyPDF now uses forkserver to avoid this situation entirely, and prefers threads by default, so the issue should be resolved.

jbarlow83 avatar Nov 12 '23 08:11 jbarlow83