core adapt to decentralized resources

trafficstars

To implement https://github.com/OCR-D/spec/pull/181, AFAICS we need:

[x] code to mix the Processor.ocrd_tool['resources'] with the preinstalled centralized database in OcrdResourceManager's constructor and .load_resource_list
[ ] code to adhere to Processor.ocrd_tool['resource_locations'] in .resolve_resource and .list_all_resources
[x] changes to the ocrd_validators.resource_list_validator, specifically its resource_list.schema.yml (subtypes of type)
[x] changes to the ocrd_validators.ocrd_tool_validator, specifically its ocrd_tool.schema.yml by updating repo/spec

Did I forget anything?

EDIT

[x] aid in migrating user ocrd/resources.yml from old syntax to new one
[ ] update documentation (spec, guides, wiki)

Feb 14 '22 13:02 bertsky

Looks fairly complete for changes in core, thank you. I'll extend #800 for the latter three points, the first one is already in place (ocrd resmgr discover).

Feb 14 '22 13:02 kba

the first one is already in place (ocrd resmgr discover).

That's not quite what I meant though. I would expect the new OcrdResourceManager.discover to be called by OcrdResourceManager.__init__ (right before merging with the user database), so there would be no install-time need for things like ocrd resmgr discover.

Feb 14 '22 13:02 bertsky

That's not quite what I meant though. I would expect the new OcrdResourceManager.discover to be called by OcrdResourceManager.__init__ (right before merging with the user database), so there would be no install-time need for things like ocrd resmgr discover.

Doing this every time the OcrdResourceManager is instantiated is a huge performance penalty. E.g. I currently have about 70 processors installed. A lot of them don't separate __init__ and setup and it takes longer than a second to get the ocrd-tool.json, so this takes at least two minutes to run --dump-json on all the processors.

Feb 14 '22 13:02 kba

Doing this every time the OcrdResourceManager is instantiated is a huge performance penalty. E.g. I currently have about 70 processors installed. A lot of them don't separate __init__ and setup and it takes longer than a second to get the ocrd-tool.json, so this takes at least two minutes to run --dump-json on all the processors.

Good point. But the distinction between constructor and setup for processing will become the norm (and if done right, could also be used to keep heavy-toll dependencies like Tensorflow imports out of the non-processing contexts). And in the context of a processor (i.e. outside of ocrd.cli.resmgr), at least for the Pythonic ones, there is no need for a --dump-json, as one can directly pick it up from the pkg_resources.

And even in the resmgr CLI, 1s delay would still be tolerable, so it does not hurt if you query a single processor.

So it is only ocrd resmgr list-available * case where things become unacceptablely slow. But is that worth sacrificing no extra post-installation step necessary?

Feb 14 '22 14:02 bertsky

And even in the resmgr CLI, 1s delay would still be tolerable, so it does not hurt if you query a single processor.

I've tested it and including all the spurious results like ocrd-make, ocrd-import etc., ocrd resmgr discover takes at most 90 seconds for a fairly complete ocrd_all installation. Since we can model the code in such a way that it is conservative with what is searched.

So it is only ocrd resmgr list-available * case where things become unacceptablely slow. But is that worth sacrificing no extra post-installation step necessary?

If we do it completely dynamically then yes, that would be the only case. And we save ourselves future headaches, if we don't try to serialize the resource descriptions to resource_list.yml (deduplication, updated processors etc. it's likely a nightmare).

Feb 14 '22 15:02 kba

I took the liberty of checking tasks that have been solved by now above.

code to mix the Processor.ocrd_tool['resources'] with the preinstalled centralized database in OcrdResourceManager's constructor and .load_resource_list

seems complete: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd/ocrd/resource_manager.py#L105-L115

code to adhere to Processor.ocrd_tool['resource_locations'] in .resolve_resource and .list_all_resources

resolve_resource / list_resource_candidates does not adhere to resource_locations yet: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd_utils/ocrd_utils/os.py#L97-L107

Also, as discussed in ocrd_tesserocr, we still need to handle resource location module correctly.

list_all_resources seems complete though: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd_utils/ocrd_utils/os.py#L115-L120

changes to the ocrd_validators.resource_list_validator, specifically its resource_list.schema.yml (subtypes of type)

seems complete: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd_validators/ocrd_validators/constants.py#L22-L28 and https://github.com/OCR-D/core/blob/master/ocrd_validators/ocrd_validators/resource_list_validator.py and https://github.com/OCR-D/core/blob/6196e3bb6cb78f4f55d65228289425b344ab0483/ocrd/ocrd/resource_manager.py#L88

changes to the ocrd_validators.ocrd_tool_validator, specifically its ocrd_tool.schema.yml by updating repo/spec

seems complete: https://github.com/OCR-D/core/blob/ceb9992ab113f1b3933b2af33819ceeef0f085b7/ocrd_validators/ocrd_validators/ocrd_tool.schema.yml#L168-L209

aid in migrating user ocrd/resources.yml from old syntax to new one

seems complete: https://github.com/OCR-D/core/blob/4148a88ea64de56114cb6ce24fcee216d563c61b/ocrd/ocrd/cli/resmgr.py#L161-L195

update documentation (spec, guides, wiki)

already tracked by https://github.com/OCR-D/spec/issues/193

Not sure about the guides and Wiki yet. I did find the section about resmgr download syntax to be out of date though.

Aug 17 '22 13:08 bertsky

core core copied to clipboard

adapt to decentralized resources

core
core copied to clipboard