[Bug/Feature]: a way to disable Ghostscript requirement & broken plugin_manager option
What were you trying to do?
I made a plugin that overrides rasterize_pdf_page and generate_pdfa, and it works great. However, when I try to remove ghostscript from the system, ocrmypdf tells me No such file or directory: 'gs' when validating hooks.
File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/subprocess/__init__.py", line 159, in get_version
proc = run(
^^^^
File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/subprocess/__init__.py", line 63, in run
proc = subprocess_run(args, env=env, check=check, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/subprocess.py", line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/subprocess.py", line 1026, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.11/subprocess.py", line 1955, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'gs'
Seeing that ghostscript.py is included into default plugins, I tried using plugin_manager option with builtins=False instead: plugin_manager=get_plugin_manager(plugins=[my_plugin_path], builtins=False). Now it fails on options = create_options( step, because plugin_manager object is not a number, string or path. Am I correct that plugin_manager option is no longer functional?
cmdline.append(f"--{cmd_style_arg}")
if isinstance(val, int | float):
cmdline.append(str(val))
elif isinstance(val, str):
cmdline.append(val)
elif isinstance(val, Path):
cmdline.append(str(val))
else:
raise TypeError(f"{arg}: {val} ({type(val)})")
File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/api.py", line 368, in ocr
options = create_options(
^^^^^^^^^^^^^^^
File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/api.py", line 200, in create_options
cmdline, deferred = _kwargs_to_cmdline(
^^^^^^^^^^^^^^^^^^^
File "/opt/assemble/atticus/current/jobs/image-to-pdf/venv/lib/python3.11/site-packages/ocrmypdf/api.py", line 179, in _kwargs_to_cmdline
raise TypeError(f"{arg}: {val} ({type(val)})")
TypeError: plugin_manager: <ocrmypdf._plugin_manager.OcrmypdfPluginManager object at 0xffffa083cd90> (<class 'ocrmypdf._plugin_manager.OcrmypdfPluginManager'>)
Sorry about the 2-in-1 issue, they're quite connected in this case.
Where are you installing/running from?
PyPI (pip, poetry, pipx, etc.)
OCRmyPDF version
15.4.3
What operating system are you working on?
Linux
Operating system details and version
Ubuntu 20.04
Simple sanity checks
- [X] Operating system is currently supported by its vendor (not end of life)
- [X] Python version is compatible with OCRmyPDF
- [X] This issue is not about a specific input file
Relevant log output
No response
It seems that the simplest fix would be to add plugin_manager to the defer_kwargs list here, though I'm not sure that's exactly the purpose of defer_kwargs. I could make a pr, if that is indeed the solution.
Or alternatively, a disable_plugins: Iterable[str] | None = None, option.
It should work without source changes if you do
plugin_manager = get_plugin_manager(plugins=['ocrmypdf.builtin_plugins.concurrency', ... all builtins except ocrmypdf.builtin_plugins.ghostscript..., 'yourcustomghostscriptreplacer'], builtins=False)
yourcustomghostscriptreplacer.py would need to @hookimpl every API that builtin_plugins/ghostscript.py does.
To convince yourself that this works, clone the source and rewrite builtins_plugins/ghostscript.py to use your PDF renderer instead of Ghostscript.
The reason fora this is that the builtin ghostscript plugin hooks check_options and tests for the existence of the gs binary. So you need to disable all default plugins to prevent this check_options hook from being installed, and install all ordinary plugins manually.
How do you get around the TypeError: plugin_manager: <ocrmypdf._plugin_manager.OcrmypdfPluginManager object at 0xffff8a341fd0> (<class 'ocrmypdf._plugin_manager.OcrmypdfPluginManager'>) error? As I mentioned in the original post, it seems that plugin_manager option cannot be used at all right now, due to the check requiring most options to be a number, string or path.
Also, am I correct that use_threads=False does not affect rasterize_pdf_page? (That's the impression I get from local tests, and from looking at the code, just wanted to confirm) And there's no way to easily re-use existing Executor logic for this?
Context: using pdfium for rendering, which is explicitly not thread-safe. Trying to see if there's a better approach than having a global pdfium_lock = threading.Lock().
use_threads affects which executor and which type of worker (thread or process). (For various reasons, it's never made sense to get rid of either type.) Then the worker calls rasterize_pdf_page. So you would need a threading lock (regardless of worker type; in the case of process the lock is just never contested).
If a plugin can't run under some configuration (including the setting of use_threads) it should hook check_options and raise an exception to say "plugin can't do that".
Long term I am considering converting ocrmypdf to rust, although I can't promise any kind of timeline, but moving to rust would mean being able to include libraries like pdfium with safe concurrency.
Thank you. With regards to the plugin_manager option triggering a TypeError, should I make a pr to fix it?
On use_threads, I may be wrong, but page_context.plugin_manager.hook.rasterize_pdf_page is always called from the main process, no? E.g. if I print os.getpid() and multiprocessing.current_process().name inside the hook, I always get the same process number and MainProcess as process name, regardless of use_threads=False.
I thought the comment about "in the case of process the lock is just never contested" above suggested otherwise, though I may be wrong.
It seems something overrides use_threads. I.e. use_threads=False when calling ocrmypdf.ocr becomes use_threads=True inside plugin's check_options. Stack trace also suggests it, since it begins in threading.py. (Doing print(''.join(traceback.format_stack())) inside rasterize_pdf_page)
I'm aware use_threads gets overridden in info.py, but those conditions don't seem to apply. (len(pages) is 10+, and available_cpu_count() is 10) I also tried specifying the jobs option, since it seems to be the thing that initialises max_workers in the link above, still no luck.
I'll try to run it in a debugger when I get a moment, just wanted to post an update.
The plugin manager error isn't an error. It's misuse of my admittedly underdocumented spec. You can just call ocrmypdf.ocr(plugins=['your plugin']). For what you're doing you probably don't need to construct a plugin manager - the main reason that's there is for testing (dependency injection). See tests/conftest.py check_ocrmypdf() for test usage.
It is true that in some cases, the directive to use_threads is overridden in certain cases (info.py), but that's a local decision.
use_threads
Had a deeper look, use_threads=False is currently ignored when using Python api, ocrmypdf.ocr(... use_threads=False, ...). That's because _kwargs_to_cmdline method treats all boolean arguments as if false is the default:
if isinstance(val, bool):
if val:
cmdline.append(f"--{cmd_style_arg}")
continue
But for use_threads, true is the default:
jobcontrol.add_argument(
'--use-threads', action='store_true', default=True, help=argparse.SUPPRESS
)
Am I correct that there is currently no way to set use_threads to false in the Python API? (As opposed to the command line API)
disabling the ghostscript check
The plugin manager error isn't an error. It's misuse of my admittedly underdocumented spec. You can just call ocrmypdf.ocr(plugins=['your plugin']).
In that case, am I correct that there is no way to remove the ghostscript plugin, and hence no way to remove the ghostscript requirement? Even if you're not using ghostscript for anything?
(Happy to make a pr for both of those, if you're open to changing the logic)
You can use --no-use-threads to override that setting.
For ghostscript, compare how the test suite replaces ghostscript with test stubs. It is definitely replaceable.
What were you trying to do?
I made a plugin that overrides rasterize_pdf_page and generate_pdfa, and it works great. However, when I try to remove ghostscript from the system, ocrmypdf tells me
No such file or directory: 'gs'when validating hooks.
this sounds interesting. @nikitar, do happen to want to share your plugin? What did you use instead? Replacing ghostscript would be nice.
moving to rust would mean being able to include libraries like pdfium with safe concurrency.
Why do you connect this with moving to Rust? Locking pdfium behind a mutex would be possible in any language, including Python.
However, I'm not sure locking in an across-the-board way (as pdfium-render seems to do) is a good design, as this may cause overhead for the frequently-called APIs. Locking selectively on the caller side might be cleaner. (However, pypdfium2 might yet need a way to plug in a mutex into its finalizer machinery.) If you want to actually parallelize pdfium calls, you will have to use processes rather than threads in any case.
I didn't realize that pdfium had these concurrency limitations in its C++ codebase when I wrote that. As you say, processes are the only way to parallelize pdfium.
I like the idea of moving to Rust because I could deliver a single binary that incorporates Tesseract (there's also some rust-native OCR), pdfium, and most other dependencies that ocrmypdf currently needs installed separately. That's something that seems like a win for users that I can't get with Python. The main missing piece is a Ghostscript-free PDF/A converter, and Rust friendly JBIG2.
I see. Technically, I assume one could also create packages bundling the dependencies with ocrmypdf's Python code (or maybe use something like PyInstaller), but I agree a single binary would be nicer. For one thing, end users wound't need to install Python itself. Rust would also be faster than Python (though, depending on what you do, this may or may not be relevant for overall performance). Avoiding the need to call external programs via subprocess and instead using libraries would be advantageous, too.
However, re-writing ocrmypdf with Rust sound like a huge amount of work, and much energy has been invested in the existing mature Python codebase + dependencies. Also, what would happen to pikepdf (and its dependents) if you do? You are probably the best open-source Python PDF committer, and it would be somewhat sad if you moved away from the Python ecosystem.
Fixed by #1555