core
core copied to clipboard
adapt to decentralized resources
To implement https://github.com/OCR-D/spec/pull/181, AFAICS we need:
- [x] code to mix the
Processor.ocrd_tool['resources']with the preinstalled centralized database inOcrdResourceManager's constructor and.load_resource_list - [ ] code to adhere to
Processor.ocrd_tool['resource_locations']in.resolve_resourceand.list_all_resources - [x] changes to the
ocrd_validators.resource_list_validator, specifically itsresource_list.schema.yml(subtypes oftype) - [x] changes to the
ocrd_validators.ocrd_tool_validator, specifically itsocrd_tool.schema.ymlby updatingrepo/spec
Did I forget anything?
EDIT
- [x] aid in migrating user
ocrd/resources.ymlfrom old syntax to new one - [ ] update documentation (spec, guides, wiki)
Looks fairly complete for changes in core, thank you. I'll extend #800 for the latter three points, the first one is already in place (ocrd resmgr discover).
the first one is already in place (
ocrd resmgr discover).
That's not quite what I meant though. I would expect the new OcrdResourceManager.discover to be called by OcrdResourceManager.__init__ (right before merging with the user database), so there would be no install-time need for things like ocrd resmgr discover.
That's not quite what I meant though. I would expect the new
OcrdResourceManager.discoverto be called byOcrdResourceManager.__init__(right before merging with the user database), so there would be no install-time need for things likeocrd resmgr discover.
Doing this every time the OcrdResourceManager is instantiated is a huge performance penalty. E.g. I currently have about 70 processors installed. A lot of them don't separate __init__ and setup and it takes longer than a second to get the ocrd-tool.json, so this takes at least two minutes to run --dump-json on all the processors.
Doing this every time the
OcrdResourceManageris instantiated is a huge performance penalty. E.g. I currently have about 70 processors installed. A lot of them don't separate__init__andsetupand it takes longer than a second to get the ocrd-tool.json, so this takes at least two minutes to run--dump-jsonon all the processors.
Good point. But the distinction between constructor and setup for processing will become the norm (and if done right, could also be used to keep heavy-toll dependencies like Tensorflow imports out of the non-processing contexts). And in the context of a processor (i.e. outside of ocrd.cli.resmgr), at least for the Pythonic ones, there is no need for a --dump-json, as one can directly pick it up from the pkg_resources.
And even in the resmgr CLI, 1s delay would still be tolerable, so it does not hurt if you query a single processor.
So it is only ocrd resmgr list-available * case where things become unacceptablely slow. But is that worth sacrificing no extra post-installation step necessary?
And even in the resmgr CLI, 1s delay would still be tolerable, so it does not hurt if you query a single processor.
I've tested it and including all the spurious results like ocrd-make, ocrd-import etc., ocrd resmgr discover takes at most 90 seconds for a fairly complete ocrd_all installation. Since we can model the code in such a way that it is conservative with what is searched.
So it is only
ocrd resmgr list-available *case where things become unacceptablely slow. But is that worth sacrificing no extra post-installation step necessary?
If we do it completely dynamically then yes, that would be the only case. And we save ourselves future headaches, if we don't try to serialize the resource descriptions to resource_list.yml (deduplication, updated processors etc. it's likely a nightmare).
I took the liberty of checking tasks that have been solved by now above.
- code to mix the
Processor.ocrd_tool['resources']with the preinstalled centralized database inOcrdResourceManager's constructor and.load_resource_list
seems complete: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd/ocrd/resource_manager.py#L105-L115
- code to adhere to
Processor.ocrd_tool['resource_locations']in.resolve_resourceand.list_all_resources
resolve_resource / list_resource_candidates does not adhere to resource_locations yet: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd_utils/ocrd_utils/os.py#L97-L107
Also, as discussed in ocrd_tesserocr, we still need to handle resource location module correctly.
list_all_resources seems complete though: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd_utils/ocrd_utils/os.py#L115-L120
- changes to the
ocrd_validators.resource_list_validator, specifically itsresource_list.schema.yml(subtypes oftype)
seems complete: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd_validators/ocrd_validators/constants.py#L22-L28 and https://github.com/OCR-D/core/blob/master/ocrd_validators/ocrd_validators/resource_list_validator.py and https://github.com/OCR-D/core/blob/6196e3bb6cb78f4f55d65228289425b344ab0483/ocrd/ocrd/resource_manager.py#L88
- changes to the
ocrd_validators.ocrd_tool_validator, specifically itsocrd_tool.schema.ymlby updatingrepo/spec
seems complete: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd_validators/ocrd_validators/ocrd_tool.schema.yml#L168-L209
- aid in migrating user
ocrd/resources.ymlfrom old syntax to new one
seems complete: https://github.com/OCR-D/core/blob/4148a88ea64de56114cb6ce24fcee216d563c61b/ocrd/ocrd/cli/resmgr.py#L161-L195
- update documentation (spec, guides, wiki)
already tracked by https://github.com/OCR-D/spec/issues/193
Not sure about the guides and Wiki yet. I did find the section about resmgr download syntax to be out of date though.