core
core copied to clipboard
Get not available ocrd_tool in ocrd_utils/os.py
The function get_ocrd_tool_json(executable) tries to load the ocr-d tool with a processor call. If the function is called with a non available processor (typo for example) then it is imo too hard to distinguish that from a successful call. Because by default the returned dict is filled with some default values, but I would expect it to be empty when the processor is not available. So I would move the call to extend the ocrd_tool-dict inside the try block and return an empty dict if the processor is not available. But I don't know if that can cause other problems.
https://github.com/OCR-D/core/blob/ee92cfc084b8e4f2dd5be2f7016f957c87ae1573/ocrd_utils/ocrd_utils/os.py#L73-L85
Yes, that's a problem IMO. I don't recall exactly why we decided to catch this error in the first place. (It could have been that some processors misbehaved, and we did not want them to spoil the overall result of a ocrd resmgr list-available "*" for example. Also, we cannot be sure whether every ocrd-* program in PATH actually is a processor CLI...)
We currently use the function in
cli.resmgr.download– where we require theresource_locationsentry, so the proposed change would currently break (i.e. a misbehaving or non-processor would trigger an uncaught exception, preventing the command from running through) https://github.com/OCR-D/core/blob/ee92cfc084b8e4f2dd5be2f7016f957c87ae1573/ocrd/ocrd/cli/resmgr.py#L142-L146OcrdResourceManager.list_available– (for the case of dynamic discovery based on the registered resources decentralised via tool json) where we default to an empty result, so nothing would break https://github.com/OCR-D/core/blob/ee92cfc084b8e4f2dd5be2f7016f957c87ae1573/ocrd/ocrd/resource_manager.py#L110-L112ProcessorTask.validate– (so far used only for prior syntax checking of workflows inocrd process) where an empty tool json simply results in an empty parameter specification, having the validator deny any runtime parameters, so again nothing would break https://github.com/OCR-D/core/blob/ee92cfc084b8e4f2dd5be2f7016f957c87ae1573/ocrd/ocrd/task_sequence.py#L71-L72
Judging by that I'd say yes, we should change as you proposed, but then need to handle the case graciously when ocrd resmgr download "*" finds a ocrd-* that misbehaves. (For example, by rewriting the dict lookup as a .get('resource_locations', RESOURCE_LOCATIONS) – allowing every location and preferring data.)
At first I wanted to use get_ocrd_tool_json to verify that a processor is available. Then I created this issue. Later I realized that this doesn't fit my needs anyway but I didn't want/forgot to close/remove the issue again.
I wondered if there is a reason why the dict-extension is not part of the try-block, I thought is just an easy fix. But your findings show me I should just have searched for the function invocation to see that the change could cause some errors. Now I wonder if the correction of this "feature" is worth the effort.
Now I wonder if the correction of this "feature" is worth the effort.
It's a minute change.
Also, we cannot be sure whether every ocrd-* program in PATH actually is a processor CLI...)
That's the main reason why we have this check, tools like ocrd-cis-data or other non-processors break it.
I agree that returning an empty dict for non-processor ocrd-* executables seems better. IIUC the only change in the function would be to not set resource_locations in that case. We'd have to adapt cli.resmgr to handle this though, by just defaulting to data for location.
If so, let's move ahead.
I agree that returning an empty dict for non-processor
ocrd-*executables seems better. IIUC the only change in the function would be to not setresource_locationsin that case. We'd have to adaptcli.resmgrto handle this though, by just defaulting todatafor location.
Yes, that's what I meant by …
(For example, by rewriting the dict lookup as a
.get('resource_locations', RESOURCE_LOCATIONS)– allowing every location and preferringdata.)