OCRmyPDF
OCRmyPDF copied to clipboard
Hanging on Random Files
Running find . -printf '%p' -maxdepth 1 -iname "*.pdf" -exec ocrmypdf -v1 '{}' '{}' ; on a directory with many pdf files, most of which have already have OCR data, ocrmypdf hangs on random files. Hitting ctrl-c gives the traceback shown below.
./test file.pdfocrmypdf 11.5.0 Running: ['tesseract', '--list-langs'] stdout/stderr = List of available languages (1): eng
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['tesseract', '-l', 'eng', '--print-parameters', 'pdf']
Running: ['gs', '--version']
Found gs 9.53.3
pikepdf mmap disabled
os.symlink(./test file.pdf, /tmp/ocrmypdf.io.n_8kmlq6/origin)
os.symlink(/tmp/ocrmypdf.io.n_8kmlq6/origin, /tmp/ocrmypdf.io.n_8kmlq6/origin.pdf)
pikepdf mmap disabled
pikepdf mmap disabled
pikepdf mmap disabled
pikepdf mmap disabled
Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 716/716 [00:20<00:00, 170.00page/s] ^C
Scanning contents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 716/716 [21:11<00:00, 1.78s/page]
[jgf@bigdog testDir]$ KeyboardInterrupt
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 359, in run_pipeline
pdfinfo = get_pdfinfo(
File "/usr/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 155, in get_pdfinfo
return PdfInfo(
File "/usr/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 833, in init
self._pages = _pdf_pageinfo_concurrent(
File "/usr/lib/python3.9/site-packages/ocrmypdf/pdfinfo/info.py", line 619, in _pdf_pageinfo_concurrent
exec_progress_pool(
File "/usr/lib/python3.9/site-packages/ocrmypdf/_concurrent.py", line 135, in exec_progress_pool
pool.join()
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 666, in join
p.join()
File "/usr/lib64/python3.9/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/usr/lib64/python3.9/multiprocessing/popen_fork.py", line 43, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/usr/lib64/python3.9/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
Describe the bug** It seems that when the files are on the NVM M.2, the testing/conversion stops at random files. When they are on the spinning disk, all of the files are analysed. Could it be a race condition with the high speed of the the NVM M.2 storage?
Expected behavior All files will be completed
System
- OS: Fedora 33 with Spinning disk, ssd, network and nvm storage
- OCRmyPDF Version: 11.5.0
- Installed via rpm package
There is a deadlock that affected the v11 releases. (Logging + multiprocessing related, it turns out.)
Chances are, upgrading to latest release will fix it. Please try that before we go any further.
Version 12.3.1 works correctly with the NVM storage. The verbose -v1 mode shows a traceback when the PriorOcrFoundError is emitted but it does not hang up. It also works consistently with the network storage as well.
Well the new version does seem to have some hangs, but much less frequently. Here is a traceback after hitting ctrl-c. Not much there. I will see if I can get a hang with -v1. PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr ^CException ignored in: <module 'threading' from '/usr/lib64/python3.9/threading.py'> Traceback (most recent call last): File "/usr/lib64/python3.9/threading.py", line 1448, in _shutdown
lock.acquire()
KeyboardInterrupt:
Test File was not the original name. It was a very long path/ I will see if it stops at other files.
./Test File.pdfocrmypdf 12.3.1
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (1):
eng
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
os.symlink(./Test File.pdf, /tmp/ocrmypdf.io.kxy_x8ii/origin)
os.symlink(/tmp/ocrmypdf.io.kxy_x8ii/origin, /tmp/ocrmypdf.io.kxy_x8ii/origin.pdf)
Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 975/975 [00:02<00:00, 359.16page/s]
Using Tesseract OpenMP thread limit 1
Start processing 4 pages concurrently
6 Rasterize with pngmono, rotation 0
OCR: 0%| | 0.0/975.0 [00:00<?, ?page/s]
ExitCodeException
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 172, in exec_page_sync
if not is_ocr_required(page_context):
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_pipeline.py", line 271, in is_ocr_required
raise PriorOcrFoundError(
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 374, in run_pipeline
exec_concurrent(context, executor)
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent
executor(
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_concurrent.py", line 82, in __call__
self._execute(
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 132, in _execute
for result in results:
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 870, in next
raise value
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
This is exception is expected for files that already have ocr.
Yes, but the processing of the files stopped after that exception. Since the ocrmypdf command was in the -exec of a find statement, ctrl-c stopped the processing. I am now using a script and it has stopped at least one file so far. After waiting some minutes, ctrl-c restarts the processing. Maybe this is file content issue and not a multiprocessing issue. I will try to send more information.
Below is an example of ocrmypdf getting stuck on a file that had prior ocr text. I hit ctrl-c twice to get it stopped and the script continued. It had been stuck for 18 min.
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
real 0m2.391s
user 0m4.369s
sys 0m0.367s
File #436 : ./Test File 436.pdf
ocrmypdf 12.3.1
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (1):
eng
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
os.symlink(./Test File 436.pdf, /tmp/ocrmypdf.io.4h6quh69/origin)
os.symlink(/tmp/ocrmypdf.io.4h6quh69/origin, /tmp/ocrmypdf.io.4h6quh69/origin.pdf)
Scanning contents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 218/218 [00:02<00:00, 91.71page/s]
Using Tesseract OpenMP thread limit 1
Start processing 4 pages concurrently
1 Rasterize with png16m, rotation 0
OCR: 0%| | 0.0/218.0 [00:00<?, ?page/s]
ExitCodeException
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 172, in exec_page_sync
if not is_ocr_required(page_context):
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_pipeline.py", line 271, in is_ocr_required
raise PriorOcrFoundError(
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 374, in run_pipeline
exec_concurrent(context, executor)
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent
executor(
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/_concurrent.py", line 82, in __call__
self._execute(
File "/opt/DATA1/git/OCRmyPDF/lib64/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 132, in _execute
for result in results:
File "/usr/lib64/python3.9/multiprocessing/pool.py", line 870, in next
raise value
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
^CException ignored in: <module 'threading' from '/usr/lib64/python3.9/threading.py'>
Traceback (most recent call last):
File "/usr/lib64/python3.9/threading.py", line 1448, in _shutdown
lock.acquire()
KeyboardInterrupt:
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib64/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
finalizer()
File "/usr/lib64/python3.9/multiprocessing/util.py", line 224, in __call__
res = self._callback(*self._args, **self._kwargs)
File "/usr/lib64/python3.9/multiprocessing/queues.py", line 201, in _finalize_join
thread.join()
File "/usr/lib64/python3.9/threading.py", line 1053, in join
self._wait_for_tstate_lock()
File "/usr/lib64/python3.9/threading.py", line 1069, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt
real 18m53.022s
user 0m10.039s
sys 0m0.349s
@jgforbes I think I've run into the same issue. Weirdly, I got it to run after canceling the command then changing any of the command line arguments (adding or removing a lang from -l
for example). Did you observe that too or can you check?
It only seems to happen when processing many files, and if they are reprocessed, it happens on different files. If I run ocrmypdf on the files where is stopped previously, it completes. Having a command line switch or not does not seem to have an effect. It also happens when the input files are on a spinning disk, so the storage speed is not the issue. I am running the latest version from a python virtual environment. Could that have an effect? Changing the command line switches as you suggest still results in hung processes.
A few things to help with debugging/reproducing if either of you are able:
- Use gdb to inspect the deadlock
In Fedora try:
sudo dnf install gdb python-debuginfo
Then wait for the process to hang. Attach to the process using gdb -p
, and use py-bt
to inspect the backtraces of the various worker processes. It may be helpful to try -j 2
or -j 4
to reduce the number of workers. We're looking for workers stuck in interesting places, like trying to acquire a lock. An IDE may be able to do this as well.
See https://devguide.python.org/gdb/
- Try the semaphore-free executor
ocrmypdf --plugin ocrmypdf.extra_plugins.semfree input.pdf output.pdf
This is less efficient but used in AWS Lambda and Android Termux, which don't have standard POSIX semaphores.
- Try threads instead of processes
ocrmypdf --use-threads ...
It only seems to happen when processing many files, and if they are reprocessed, it happens on different files. If I run ocrmypdf on the files where is stopped previously, it completes. Having a command line switch or not does not seem to have an effect. It also happens when the input files are on a spinning disk, so the storage speed is not the issue. I am running the latest version from a python virtual environment. Could that have an effect? Changing the command line switches as you suggest still results in hung processes. Running the containerized version on a different server on the same dataset still hangs, but with much less frequency.
I have performed the test with the semaphore-free executor and it completed the data set without hanging. Qualitatively the average time seems to be much less, but some took minutes to report that there was already OCR text.
I will try the debugger to see what I can find.
There does not seem to be a python-debuginfo package for Fedora 33. The devguide link says When CPython is built you will notice a python-gdb.py file in the root directory of your checkout." The python-gdb.py file is needed to use gdb with python.
OK, I got the python 3.9 debugging information. There is only one hung process
Attaching to process 163714
[New LWP 163720]
[New LWP 163780]
[New LWP 163794]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f135045c0d4 in do_futex_wait.constprop () from /lib64/libpthread.so.0
Missing separate debuginfos, use: dnf debuginfo-install bzip2-libs-1.0.8-4.fc33.x86_64 expat-2.4.1-1.fc33.x86_64 fribidi-devel-1.0.10-3.fc33.x86_64 giflib-5.2.1-6.fc33.x86_64 glibc-2.32-10.fc33.x86_64 jbigkit-libs-2.1-19.fc33.x86_64 leptonica-1.81.1-1.fc33.x86_64 libgcc-10.3.1-1.fc33.x86_64 libjpeg-turbo-2.0.5-5.fc33.x86_64 libpng-1.6.37-6.fc33.x86_64 libstdc++-10.3.1-1.fc33.x86_64 libtiff-4.1.0-8.fc33.x86_64 libuuid-2.36.1-1.fc33.x86_64 libwebp-1.2.0-1.fc33.x86_64 libzstd-1.5.0-1.fc33.x86_64 openssl-libs-1.1.1k-1.fc33.x86_64 xz-libs-5.2.5-4.fc33.x86_64 zlib-1.2.11-23.fc33.x86_64
(gdb) py-bt
Traceback (most recent call first):
File "/usr/lib64/python3.9/threading.py", line 1448, in _shutdown
lock.acquire()
(gdb)
Which is the same traceback shown when ctrl-C is pressed.
I think this happens if the Linux OOM killer kills one of the worker child processes.
This is probably/hopefully resolved as of v13.1. ocrmypdf now recognizes if one of its child workers has crashed or killed by OOM. Previously this kind of error handling only worked if a Python exception was thrown so certain errors would be missed.
I am writing to report that the problem described by OP has not been fixed in OCRmyPDF 13.3.0 (with Python 3.9.10). However, I also want to make a note that the problem can be avoided with the "--use-threads" argument as suggested by @jbarlow83 on Aug 6, 2021. (See also: https://github.com/ocrmypdf/OCRmyPDF/issues/800) I think it would be great if the problem can be fixed within OCRmyPDF. However, if fixing this problem is not a priority, I think the "--use-threads" argument should be added into the documentation. The argument is not currently mentioned in the documentation, and it is not ideal that the work-around is mentioned only in comments to posts about issues.
More specifically, the problem comes up when the argument "--skip-text" or "--redo-OCR" is used on a file that had prior OCR text. OCRmyPDF hangs on the OCR step in either case. It does not show up for a PDF file that is completely made of scanned images with no prior OCR. It does not make any sense why OCRmyPDF should hang on the OCR step when OCRmyPDF says it is going to "skip all processing" on every single page with the "--skip-text" argument, as seen in the following example, but it does.
$ ocrmypdf --skip-text good_PDF_2.pdf output.pdf Scanning contents: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 180.08page/s] Start processing 15 pages concurrently 1 skipping all processing on this page 2 skipping all processing on this page 3 skipping all processing on this page 4 skipping all processing on this page 5 skipping all processing on this page 6 skipping all processing on this page 8 skipping all processing on this page 7 skipping all processing on this page 10 skipping all processing on this page 9 skipping all processing on this page 12 skipping all processing on this page 11 skipping all processing on this page 13 skipping all processing on this page 14 skipping all processing on this page 15 skipping all processing on this page OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.0/15.0 [00:00<00:00, 249.21page/s]OCR: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.0/15.0 [00:03<00:00, 4.63page/s] KeyboardInterrupt ^CError in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/----/miniconda3/envs/-----/lib/python3.9/multiprocessing/popen_fork.py", line 27, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt
But, as already stated, the "--skip-text" argument is a workaround to the issue.
$ ocrmypdf --skip-text --use-threads good_PDF_2.pdf output.pdf Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 92.11page/s] Start processing 15 pages concurrently 1 skipping all processing on this page 2 skipping all processing on this page 3 skipping all processing on this page 4 skipping all processing on this page 5 skipping all processing on this page 6 skipping all processing on this page 7 skipping all processing on this page 8 skipping all processing on this page 9 skipping all processing on this page 10 skipping all processing on this page 11 skipping all processing on this page 12 skipping all processing on this page 13 skipping all processing on this page 14 skipping all processing on this page 15 skipping all processing on this page OCR: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.0/15.0 [00:00<00:00, 1121.17page/s] Postprocessing... PDF/A conversion: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 32.30page/s] Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata. JPEGs: 0image [00:00, ?image/s] JBIG2: 0item [00:00, ?item/s] Optimize ratio: 1.00 savings: 0.1% Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)
However, if fixing this problem is not a priority,
I don't have a way to reproduce this so it's not that easy to investigate, not that it's the sort of issue that is easy to investigate in the first place. Python 3.10 may be worth a shot if available in the event it's a Python standard library issue.
--use-threads
gives much slower performance.
@jbarlow83 Thanks! I will give Python 3.10 a try at a later time. It appears that OP and I and the poster of https://github.com/ocrmypdf/OCRmyPDF/issues/800 all use virtual environments. Maybe this is all a virtual environment
- multiprocessing problem or possibly even something specific to Anaconda/Miniconda?
Anyway, slower is better than nonfunctional. It may be worth mentioning --use-threads
in the documentation as a potential troubleshooting setting when OCRmyPDF is used on a range of PDF files whose scan/"born native" characteristics are not known ahead of time. That way, users will not question their sanity like I did when OCRmyPDF works for some files but not others.
For myself, OCRmyPDF is worth using even if I have to use it at a slower speed. I hope this problem can be fixed in time.
Just ran into this bug w/ ocrmypdf 13.4.0. Happens in both python 3.9 and 3.10. I have 12 cores on my machine; ocrmypdf completes successfully w/ -j 1 up to 7 or 8, but fails w/ -j 8 or 9 or more. (It also completes successfully if I add --use-threads, even w/, say, -j 12.)
The offending pdf is a 31-page federal court opinion. Possibly related: the file apparently already has OCR text, but search didn't work, hence my need to use ocrmypdf.
$ ocrmypdf federal.court.opinion.pdf federal.court.opinion2.pdf --redo-ocr -j 12
Scanning contents: 100%|| 31/31 [00:00<00:00, 72.11page/s]
Start processing 9 pages concurrently
1 redoing OCR
2 redoing OCR
...
31 redoing OCR
OCR: 98%|| 30.5/31.0 [00:28<00:00, 1.64page/s]OCR: 100%|| 31.0/31.0 [06:26<00:00, 12.48s/page]
KeyboardInterrupt
^CException ignored in: <module 'threading' from '/usr/lib/python3.10/threading.py'>
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1525, in _shutdown
atexit_call()
File "/usr/lib/python3.10/concurrent/futures/process.py", line 95, in _python_exit
t.join()
File "/usr/lib/python3.10/threading.py", line 1089, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.10/threading.py", line 1105, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
KeyboardInterrupt:
^CException ignored in atexit callback: <function _exit_function at 0x7fe59433b9a0>
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/usr/lib/python3.10/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt:
@edhalter Thank you for posting. I was the previous poster and would like this problem fixed. I am glad you added to this thread.
My earlier post was created using a computing cluster. I do not know how many cores there are on the cluster except that there are a lot of them.
I tested my original 15 page PDF on my cluster:
With no "-j" setting, it processes 15 pages concurrently but hangs. With no "-j" settings and "--use-threads", all 15 pages are processed concurrently and successfully. With "-j X" settings, X pages are processed concurrently. It succeeds with X <= 5, otherwise, it hangs.
Using your 31 page PDF file on my cluster:
With no "-j" setting, it processes 31 pages concurrently but hangs. With no "-j" settings and "--use-threads", all 31 pages are processed concurrently and successfully. With "-j X" settings, X pages are processed concurrently. It succeeds with X <= 5, otherwise, it hangs.
The results are essentially identical using my file and your file.
With my cluster I am able to process more pages than you can concurrently but my maximum successful "-j" setting is lower than yours, at 5 instead of 8.
So I can confirm that throttling using the "-j" flag can avoid the hang. However, this is clearly not a fix to the problem.
First, v13.1 does fix at least some versions of this issue, since a failed child process (which can be simulated easily) would definitely hang ocrmypdf and that is fixed now. What users have reported since is probably something different but the behavior is the same to a user so we'll deal with it here.
@edhalter 's file seems to be a consistent reproducer of the issue post-13.1 which I very much needed.
The stack trace everyone is posting seems to be https://bugs.python.org/issue45274 for which there are partial fixes that are not available in many people's CPython's yet. That stack trace is also likely unrelated to the issue. Rather when the issue occurs, a side effect is that Ctrl+C cannot be processed normally and the Python issue 45274 also occurs. It's collateral damage that gives no clue as to what actually happening. Meaning also, a debugger cannot step in normally, since debuggers send SIGINT which is the same as pressing Ctrl+C. In short, this has been difficult to explore.
Tests:
Explored the possibility that that tls.pageno
- used to ensure page numbers are logged correctly in error messages - is involved since it involves thread local storage. Issue still occurs.
listener.join(timeout=5)
- Issue still occurs.
--no-progress-bar
- Issue still occurs. But seems less frequent.
-q
- Issue not reproducible in 10 attempts.
Bypass logger.handle(record)
(suppress all log messages send from child processes) - not reproducible
Replace log_listener with logging.handlers.QueueListener - Issue seems to occur almost 100%.
Remove with pool_lock
- still occurs
Ensure listener.stop called after executor.exit with try/finally - still occurs (probably keep this change, seems better)
put sentinel before executor.exit - causes BrokenPipeError from somewhere
Explicit shutdown before executor.exit - defers deadlock to end
Getting pretty deep into Python internals.....
Note -j12
aggravates the issues; seems related to a worker process ending up without any work to do.
Try TqdmConsole
lock around write, flush - deadlocks
Try with logging_redirect_tqdm(): result = run_pipeline(...)
and disable TqdmConsole
, letting tqdm API handle messaging - deadlocks
Try only attempt listener.join() if executor raises no exception - deadlocks
Try q.get(timeout=2)
- deadlocks
Try q.get(timeout=2)
and break on empty queue - deadlocks (empty queue condition occurs repeatedly)
I believe this is an instance of a very general multiprocessing issue in Python, that forking a process that has created threads can create a deadlock. OCRmyPDF now uses forkserver to avoid this situation entirely, and prefers threads by default, so the issue should be resolved.